%%html
<div style="background-color: #f0f0f5; padding: 20px; border-radius: 10px; text-align: center;">
<h1 style="color: #3333cc;">Przygotowanie Danych</h1>
</div>
Przygotowanie Danych
Wstęp¶
Import bibliotek¶
# Podstawowe biblioteki
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import os
Zmiana automatycznych utawień wyświetlania tabel i zaokrągleń¶
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('float_format', '{:.2f}'.format)
Wczytanie danych¶
Źródło danych: https://www.kaggle.com/c/home-credit-default-risk/data
application = pd.read_csv("application_train.csv")
bureau = pd.read_csv("bureau.csv")
bureau_balance = pd.read_csv("bureau_balance.csv")
credit_card_balance = pd.read_csv("credit_card_balance.csv")
installments_payments = pd.read_csv("installments_payments.csv")
POS_CASH_balance = pd.read_csv("POS_CASH_balance.csv")
previous_application = pd.read_csv("previous_application.csv")
Opis zbioru danych
- łącznie dane pochodzą z 7 różnych plików, dostarczając informacji z różnych źródeł, więc są podstawą do zbudowania solidnego i dokładnego modelu scoringowego
- główną tabelą jest tabela z aplikacjami klientów
- do niej dołączone są informacje z biura informacji kredytowej (bureau), które zawierają szczegółowe dane na temat poprzednich pożyczek pożyczkobiorcy udzielonych przez inne banki
- dodatkowo dołączone są informacje o poprzednich aplikacjach wnioskodawców m.in. dotyczące pożyczek gotówkowych (pos_cash), spłacania rat oraz posiadanych kart kredytowych
Sprawdzenie danych¶
Rozmiar danych¶
print(f"Rozmiar tabeli application: {application.shape}")
print(f"Rozmiar tabeli bureau: {bureau.shape}")
print(f"Rozmiar tabeli bureau_balance: {bureau_balance.shape}")
print(f"Rozmiar tabeli previous_application: {previous_application.shape}")
print(f"Rozmiar tabeli POS_CASH_balance: {POS_CASH_balance.shape}")
print(f"Rozmiar tabeli installments_payments: {installments_payments.shape}")
print(f"Rozmiar tabeli credit_card_balance: {credit_card_balance.shape}")
Rozmiar tabeli application: (307511, 122) Rozmiar tabeli bureau: (1716428, 17) Rozmiar tabeli bureau_balance: (27299925, 3) Rozmiar tabeli previous_application: (1670214, 37) Rozmiar tabeli POS_CASH_balance: (10001358, 8) Rozmiar tabeli installments_payments: (13605401, 8) Rozmiar tabeli credit_card_balance: (3840312, 23)
application.head(1)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.00 | 406597.50 | 24700.50 | 351000.00 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.02 | -9461 | -637 | -3648.00 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.00 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.08 | 0.26 | 0.14 | 0.02 | 0.04 | 0.97 | 0.62 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.63 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | 0.03 | 0.04 | 0.97 | 0.62 | 0.01 | 0.00 | 0.07 | 0.08 | 0.12 | 0.04 | 0.02 | 0.02 | 0.00 | 0.00 | reg oper account | block of flats | 0.01 | Stone, brick | No | 2.00 | 2.00 | 2.00 | 2.00 | -1134.00 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
# Obliczenie liczby wystąpień zmiennej celu oraz jej procentowego udziału w zbiorze danych
value_counts = application['TARGET'].value_counts()
percentage = (application['TARGET'].value_counts() / len(application)) * 100
# Stworzenie DataFrame
summary_table = pd.DataFrame({
'Value': value_counts.index,
'Count': value_counts.values,
'Percentage (%)': percentage.values
})
print(summary_table)
Value Count Percentage (%) 0 0 282686 91.93 1 1 24825 8.07
bureau.head(1)
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.00 | -153.00 | NaN | 0 | 91323.00 | 0.00 | NaN | 0.00 | Consumer credit | -131 | NaN |
bureau_balance.head(1)
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
previous_application.head(1)
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.43 | 17145.00 | 17145.00 | 0.00 | 17145.00 | SATURDAY | 15 | Y | 1 | 0.00 | 0.18 | 0.87 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.00 | middle | POS mobile with interest | 365243.00 | -42.00 | 300.00 | -42.00 | -37.00 | 0.00 |
POS_CASH_balance.head(1)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.00 | 45.00 | Active | 0 | 0 |
installments_payments.head(1)
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.00 | 6 | -1180.00 | -1187.00 | 6948.36 | 6948.36 |
credit_card_balance.head(1)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.97 | 135000 | 0.00 | 877.50 | 0.00 | 877.50 | 1700.33 | 1800.00 | 1800.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1 | 0.00 | 1.00 | 35.00 | Active | 0 | 0 |
Przygotowanie danych cz.I¶
Tabela "application"¶
Sprawdzenie podstawowych charakterystyk zbioru¶
application.shape
(307511, 122)
results = []
total_rows = len(application) #liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in application:
unique_values = application[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = application[column].dtype # Typ danych
null_count = application[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = application[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif application[column].dtype == 'object':
values_to_display = application[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_CURR | int64 | 307511 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | TARGET | int64 | 2 | [1, 0] | 0 | 0.00 |
| 2 | NAME_CONTRACT_TYPE | object | 2 | [Cash loans, Revolving loans] | 0 | 0.00 |
| 3 | CODE_GENDER | object | 3 | [M, F, XNA] | 0 | 0.00 |
| 4 | FLAG_OWN_CAR | object | 2 | [N, Y] | 0 | 0.00 |
| 5 | FLAG_OWN_REALTY | object | 2 | [Y, N] | 0 | 0.00 |
| 6 | CNT_CHILDREN | int64 | 15 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | AMT_INCOME_TOTAL | float64 | 2548 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 8 | AMT_CREDIT | float64 | 5603 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 9 | AMT_ANNUITY | float64 | 13672 | > 5 unikatowych wartości liczbowych | 12 | 0.00 |
| 10 | AMT_GOODS_PRICE | float64 | 1002 | > 5 unikatowych wartości liczbowych | 278 | 0.09 |
| 11 | NAME_TYPE_SUITE | object | 7 | [Unaccompanied, Family, Spouse, partner, Children, Other_A, nan, Other_B, Group of people] | 1292 | 0.42 |
| 12 | NAME_INCOME_TYPE | object | 8 | [Working, State servant, Commercial associate, Pensioner, Unemployed, Student, Businessman, Maternity leave] | 0 | 0.00 |
| 13 | NAME_EDUCATION_TYPE | object | 5 | [Secondary / secondary special, Higher education, Incomplete higher, Lower secondary, Academic degree] | 0 | 0.00 |
| 14 | NAME_FAMILY_STATUS | object | 6 | [Single / not married, Married, Civil marriage, Widow, Separated, Unknown] | 0 | 0.00 |
| 15 | NAME_HOUSING_TYPE | object | 6 | [House / apartment, Rented apartment, With parents, Municipal apartment, Office apartment, Co-op apartment] | 0 | 0.00 |
| 16 | REGION_POPULATION_RELATIVE | float64 | 81 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 17 | DAYS_BIRTH | int64 | 17460 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 18 | DAYS_EMPLOYED | int64 | 12574 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 19 | DAYS_REGISTRATION | float64 | 15688 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 20 | DAYS_ID_PUBLISH | int64 | 6168 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 21 | OWN_CAR_AGE | float64 | 62 | > 5 unikatowych wartości liczbowych | 202929 | 65.99 |
| 22 | FLAG_MOBIL | int64 | 2 | [1, 0] | 0 | 0.00 |
| 23 | FLAG_EMP_PHONE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 24 | FLAG_WORK_PHONE | int64 | 2 | [0, 1] | 0 | 0.00 |
| 25 | FLAG_CONT_MOBILE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 26 | FLAG_PHONE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 27 | FLAG_EMAIL | int64 | 2 | [0, 1] | 0 | 0.00 |
| 28 | OCCUPATION_TYPE | object | 18 | [Laborers, Core staff, Accountants, Managers, nan, Drivers, Sales staff, Cleaning staff, Cooking staff, Private service staff, Medicine staff, Security staff, High skill tech staff, Waiters/barmen staff, Low-skill Laborers, Realty agents, Secretaries, IT staff, HR staff] | 96391 | 31.35 |
| 29 | CNT_FAM_MEMBERS | float64 | 17 | > 5 unikatowych wartości liczbowych | 2 | 0.00 |
| 30 | REGION_RATING_CLIENT | int64 | 3 | [2, 1, 3] | 0 | 0.00 |
| 31 | REGION_RATING_CLIENT_W_CITY | int64 | 3 | [2, 1, 3] | 0 | 0.00 |
| 32 | WEEKDAY_APPR_PROCESS_START | object | 7 | [WEDNESDAY, MONDAY, THURSDAY, SUNDAY, SATURDAY, FRIDAY, TUESDAY] | 0 | 0.00 |
| 33 | HOUR_APPR_PROCESS_START | int64 | 24 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 34 | REG_REGION_NOT_LIVE_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 35 | REG_REGION_NOT_WORK_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 36 | LIVE_REGION_NOT_WORK_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 37 | REG_CITY_NOT_LIVE_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 38 | REG_CITY_NOT_WORK_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 39 | LIVE_CITY_NOT_WORK_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 40 | ORGANIZATION_TYPE | object | 58 | [Business Entity Type 3, School, Government, Religion, Other, XNA, Electricity, Medicine, Business Entity Type 2, Self-employed, Transport: type 2, Construction, Housing, Kindergarten, Trade: type 7, Industry: type 11, Military, Services, Security Ministries, Transport: type 4, Industry: type 1, Emergency, Security, Trade: type 2, University, Transport: type 3, Police, Business Entity Type 1, Postal, Industry: type 4, Agriculture, Restaurant, Culture, Hotel, Industry: type 7, Trade: type 3, Industry: type 3, Bank, Industry: type 9, Insurance, Trade: type 6, Industry: type 2, Transport: type 1, Industry: type 12, Mobile, Trade: type 1, Industry: type 5, Industry: type 10, Legal Services, Advertising, Trade: type 5, Cleaning, Industry: type 13, Trade: type 4, Telecom, Industry: type 8, Realtor, Industry: type 6] | 0 | 0.00 |
| 41 | EXT_SOURCE_1 | float64 | 114584 | > 5 unikatowych wartości liczbowych | 173378 | 56.38 |
| 42 | EXT_SOURCE_2 | float64 | 119831 | > 5 unikatowych wartości liczbowych | 660 | 0.21 |
| 43 | EXT_SOURCE_3 | float64 | 814 | > 5 unikatowych wartości liczbowych | 60965 | 19.83 |
| 44 | APARTMENTS_AVG | float64 | 2339 | > 5 unikatowych wartości liczbowych | 156061 | 50.75 |
| 45 | BASEMENTAREA_AVG | float64 | 3780 | > 5 unikatowych wartości liczbowych | 179943 | 58.52 |
| 46 | YEARS_BEGINEXPLUATATION_AVG | float64 | 285 | > 5 unikatowych wartości liczbowych | 150007 | 48.78 |
| 47 | YEARS_BUILD_AVG | float64 | 149 | > 5 unikatowych wartości liczbowych | 204488 | 66.50 |
| 48 | COMMONAREA_AVG | float64 | 3181 | > 5 unikatowych wartości liczbowych | 214865 | 69.87 |
| 49 | ELEVATORS_AVG | float64 | 257 | > 5 unikatowych wartości liczbowych | 163891 | 53.30 |
| 50 | ENTRANCES_AVG | float64 | 285 | > 5 unikatowych wartości liczbowych | 154828 | 50.35 |
| 51 | FLOORSMAX_AVG | float64 | 403 | > 5 unikatowych wartości liczbowych | 153020 | 49.76 |
| 52 | FLOORSMIN_AVG | float64 | 305 | > 5 unikatowych wartości liczbowych | 208642 | 67.85 |
| 53 | LANDAREA_AVG | float64 | 3527 | > 5 unikatowych wartości liczbowych | 182590 | 59.38 |
| 54 | LIVINGAPARTMENTS_AVG | float64 | 1868 | > 5 unikatowych wartości liczbowych | 210199 | 68.35 |
| 55 | LIVINGAREA_AVG | float64 | 5199 | > 5 unikatowych wartości liczbowych | 154350 | 50.19 |
| 56 | NONLIVINGAPARTMENTS_AVG | float64 | 386 | > 5 unikatowych wartości liczbowych | 213514 | 69.43 |
| 57 | NONLIVINGAREA_AVG | float64 | 3290 | > 5 unikatowych wartości liczbowych | 169682 | 55.18 |
| 58 | APARTMENTS_MODE | float64 | 760 | > 5 unikatowych wartości liczbowych | 156061 | 50.75 |
| 59 | BASEMENTAREA_MODE | float64 | 3841 | > 5 unikatowych wartości liczbowych | 179943 | 58.52 |
| 60 | YEARS_BEGINEXPLUATATION_MODE | float64 | 221 | > 5 unikatowych wartości liczbowych | 150007 | 48.78 |
| 61 | YEARS_BUILD_MODE | float64 | 154 | > 5 unikatowych wartości liczbowych | 204488 | 66.50 |
| 62 | COMMONAREA_MODE | float64 | 3128 | > 5 unikatowych wartości liczbowych | 214865 | 69.87 |
| 63 | ELEVATORS_MODE | float64 | 26 | > 5 unikatowych wartości liczbowych | 163891 | 53.30 |
| 64 | ENTRANCES_MODE | float64 | 30 | > 5 unikatowych wartości liczbowych | 154828 | 50.35 |
| 65 | FLOORSMAX_MODE | float64 | 25 | > 5 unikatowych wartości liczbowych | 153020 | 49.76 |
| 66 | FLOORSMIN_MODE | float64 | 25 | > 5 unikatowych wartości liczbowych | 208642 | 67.85 |
| 67 | LANDAREA_MODE | float64 | 3563 | > 5 unikatowych wartości liczbowych | 182590 | 59.38 |
| 68 | LIVINGAPARTMENTS_MODE | float64 | 736 | > 5 unikatowych wartości liczbowych | 210199 | 68.35 |
| 69 | LIVINGAREA_MODE | float64 | 5301 | > 5 unikatowych wartości liczbowych | 154350 | 50.19 |
| 70 | NONLIVINGAPARTMENTS_MODE | float64 | 167 | > 5 unikatowych wartości liczbowych | 213514 | 69.43 |
| 71 | NONLIVINGAREA_MODE | float64 | 3327 | > 5 unikatowych wartości liczbowych | 169682 | 55.18 |
| 72 | APARTMENTS_MEDI | float64 | 1148 | > 5 unikatowych wartości liczbowych | 156061 | 50.75 |
| 73 | BASEMENTAREA_MEDI | float64 | 3772 | > 5 unikatowych wartości liczbowych | 179943 | 58.52 |
| 74 | YEARS_BEGINEXPLUATATION_MEDI | float64 | 245 | > 5 unikatowych wartości liczbowych | 150007 | 48.78 |
| 75 | YEARS_BUILD_MEDI | float64 | 151 | > 5 unikatowych wartości liczbowych | 204488 | 66.50 |
| 76 | COMMONAREA_MEDI | float64 | 3202 | > 5 unikatowych wartości liczbowych | 214865 | 69.87 |
| 77 | ELEVATORS_MEDI | float64 | 46 | > 5 unikatowych wartości liczbowych | 163891 | 53.30 |
| 78 | ENTRANCES_MEDI | float64 | 46 | > 5 unikatowych wartości liczbowych | 154828 | 50.35 |
| 79 | FLOORSMAX_MEDI | float64 | 49 | > 5 unikatowych wartości liczbowych | 153020 | 49.76 |
| 80 | FLOORSMIN_MEDI | float64 | 47 | > 5 unikatowych wartości liczbowych | 208642 | 67.85 |
| 81 | LANDAREA_MEDI | float64 | 3560 | > 5 unikatowych wartości liczbowych | 182590 | 59.38 |
| 82 | LIVINGAPARTMENTS_MEDI | float64 | 1097 | > 5 unikatowych wartości liczbowych | 210199 | 68.35 |
| 83 | LIVINGAREA_MEDI | float64 | 5281 | > 5 unikatowych wartości liczbowych | 154350 | 50.19 |
| 84 | NONLIVINGAPARTMENTS_MEDI | float64 | 214 | > 5 unikatowych wartości liczbowych | 213514 | 69.43 |
| 85 | NONLIVINGAREA_MEDI | float64 | 3323 | > 5 unikatowych wartości liczbowych | 169682 | 55.18 |
| 86 | FONDKAPREMONT_MODE | object | 4 | [reg oper account, nan, org spec account, reg oper spec account, not specified] | 210295 | 68.39 |
| 87 | HOUSETYPE_MODE | object | 3 | [block of flats, nan, terraced house, specific housing] | 154297 | 50.18 |
| 88 | TOTALAREA_MODE | float64 | 5116 | > 5 unikatowych wartości liczbowych | 148431 | 48.27 |
| 89 | WALLSMATERIAL_MODE | object | 7 | [Stone, brick, Block, nan, Panel, Mixed, Wooden, Others, Monolithic] | 156341 | 50.84 |
| 90 | EMERGENCYSTATE_MODE | object | 2 | [No, nan, Yes] | 145755 | 47.40 |
| 91 | OBS_30_CNT_SOCIAL_CIRCLE | float64 | 33 | > 5 unikatowych wartości liczbowych | 1021 | 0.33 |
| 92 | DEF_30_CNT_SOCIAL_CIRCLE | float64 | 10 | > 5 unikatowych wartości liczbowych | 1021 | 0.33 |
| 93 | OBS_60_CNT_SOCIAL_CIRCLE | float64 | 33 | > 5 unikatowych wartości liczbowych | 1021 | 0.33 |
| 94 | DEF_60_CNT_SOCIAL_CIRCLE | float64 | 9 | > 5 unikatowych wartości liczbowych | 1021 | 0.33 |
| 95 | DAYS_LAST_PHONE_CHANGE | float64 | 3773 | > 5 unikatowych wartości liczbowych | 1 | 0.00 |
| 96 | FLAG_DOCUMENT_2 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 97 | FLAG_DOCUMENT_3 | int64 | 2 | [1, 0] | 0 | 0.00 |
| 98 | FLAG_DOCUMENT_4 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 99 | FLAG_DOCUMENT_5 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 100 | FLAG_DOCUMENT_6 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 101 | FLAG_DOCUMENT_7 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 102 | FLAG_DOCUMENT_8 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 103 | FLAG_DOCUMENT_9 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 104 | FLAG_DOCUMENT_10 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 105 | FLAG_DOCUMENT_11 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 106 | FLAG_DOCUMENT_12 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 107 | FLAG_DOCUMENT_13 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 108 | FLAG_DOCUMENT_14 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 109 | FLAG_DOCUMENT_15 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 110 | FLAG_DOCUMENT_16 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 111 | FLAG_DOCUMENT_17 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 112 | FLAG_DOCUMENT_18 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 113 | FLAG_DOCUMENT_19 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 114 | FLAG_DOCUMENT_20 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 115 | FLAG_DOCUMENT_21 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 116 | AMT_REQ_CREDIT_BUREAU_HOUR | float64 | 5 | [0.0, nan, 1.0, 2.0, 3.0, 4.0] | 41519 | 13.50 |
| 117 | AMT_REQ_CREDIT_BUREAU_DAY | float64 | 9 | > 5 unikatowych wartości liczbowych | 41519 | 13.50 |
| 118 | AMT_REQ_CREDIT_BUREAU_WEEK | float64 | 9 | > 5 unikatowych wartości liczbowych | 41519 | 13.50 |
| 119 | AMT_REQ_CREDIT_BUREAU_MON | float64 | 24 | > 5 unikatowych wartości liczbowych | 41519 | 13.50 |
| 120 | AMT_REQ_CREDIT_BUREAU_QRT | float64 | 11 | > 5 unikatowych wartości liczbowych | 41519 | 13.50 |
| 121 | AMT_REQ_CREDIT_BUREAU_YEAR | float64 | 25 | > 5 unikatowych wartości liczbowych | 41519 | 13.50 |
# statystyki dla zmiennych liczbowych
application.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307499.00 | 307233.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 104582.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307509.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 134133.00 | 306851.00 | 246546.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 151450.00 | 127568.00 | 157504.00 | 103023.00 | 92646.00 | 143620.00 | 152683.00 | 154491.00 | 98869.00 | 124921.00 | 97312.00 | 153161.00 | 93997.00 | 137829.00 | 159080.00 | 306490.00 | 306490.00 | 306490.00 | 306490.00 | 307510.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 307511.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 | 265992.00 |
| mean | 278180.52 | 0.08 | 0.42 | 168797.92 | 599026.00 | 27108.57 | 538396.21 | 0.02 | -16037.00 | 63815.05 | -4986.12 | -2994.20 | 12.06 | 1.00 | 0.82 | 0.20 | 1.00 | 0.28 | 0.06 | 2.15 | 2.05 | 2.03 | 12.06 | 0.02 | 0.05 | 0.04 | 0.08 | 0.23 | 0.18 | 0.50 | 0.51 | 0.51 | 0.12 | 0.09 | 0.98 | 0.75 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.11 | 0.09 | 0.98 | 0.76 | 0.04 | 0.07 | 0.15 | 0.22 | 0.23 | 0.06 | 0.11 | 0.11 | 0.01 | 0.03 | 0.12 | 0.09 | 0.98 | 0.76 | 0.04 | 0.08 | 0.15 | 0.23 | 0.23 | 0.07 | 0.10 | 0.11 | 0.01 | 0.03 | 0.10 | 1.42 | 0.14 | 1.41 | 0.10 | -962.86 | 0.00 | 0.71 | 0.00 | 0.02 | 0.09 | 0.00 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.27 | 0.27 | 1.90 |
| std | 102790.18 | 0.27 | 0.72 | 237123.15 | 402490.78 | 14493.74 | 369446.46 | 0.01 | 4363.99 | 141275.77 | 3522.89 | 1509.45 | 11.94 | 0.00 | 0.38 | 0.40 | 0.04 | 0.45 | 0.23 | 0.91 | 0.51 | 0.50 | 3.27 | 0.12 | 0.22 | 0.20 | 0.27 | 0.42 | 0.38 | 0.21 | 0.19 | 0.19 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.07 | 0.13 | 0.10 | 0.14 | 0.16 | 0.08 | 0.10 | 0.11 | 0.05 | 0.07 | 0.11 | 0.08 | 0.06 | 0.11 | 0.08 | 0.13 | 0.10 | 0.15 | 0.16 | 0.08 | 0.09 | 0.11 | 0.05 | 0.07 | 0.11 | 2.40 | 0.45 | 2.38 | 0.36 | 826.81 | 0.01 | 0.45 | 0.01 | 0.12 | 0.28 | 0.01 | 0.27 | 0.06 | 0.00 | 0.06 | 0.00 | 0.06 | 0.05 | 0.03 | 0.10 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.08 | 0.11 | 0.20 | 0.92 | 0.79 | 1.87 |
| min | 100002.00 | 0.00 | 0.00 | 25650.00 | 45000.00 | 1615.50 | 40500.00 | 0.00 | -25229.00 | -17912.00 | -24672.00 | -7197.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -4292.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 189145.50 | 0.00 | 0.00 | 112500.00 | 270000.00 | 16524.00 | 238500.00 | 0.01 | -19682.00 | -2760.00 | -7479.50 | -4299.00 | 5.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 10.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.39 | 0.37 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.05 | 0.04 | 0.98 | 0.70 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.04 | 0.00 | 0.00 | 0.06 | 0.04 | 0.98 | 0.69 | 0.01 | 0.00 | 0.07 | 0.17 | 0.08 | 0.02 | 0.05 | 0.05 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | -1570.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 278202.00 | 0.00 | 0.00 | 147150.00 | 513531.00 | 24903.00 | 450000.00 | 0.02 | -15750.00 | -1213.00 | -4504.00 | -3254.00 | 9.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 | 2.00 | 2.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.51 | 0.57 | 0.54 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.08 | 0.07 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.09 | 0.08 | 0.98 | 0.76 | 0.02 | 0.00 | 0.14 | 0.17 | 0.21 | 0.05 | 0.08 | 0.07 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -757.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 367142.50 | 0.00 | 1.00 | 202500.00 | 808650.00 | 34596.00 | 679500.00 | 0.03 | -12413.00 | -289.00 | -2010.00 | -1720.00 | 15.00 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.00 | 3.00 | 2.00 | 2.00 | 14.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.68 | 0.66 | 0.67 | 0.15 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.14 | 0.11 | 0.99 | 0.82 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.08 | 0.13 | 0.13 | 0.00 | 0.02 | 0.15 | 0.11 | 0.99 | 0.83 | 0.05 | 0.12 | 0.21 | 0.33 | 0.38 | 0.09 | 0.12 | 0.13 | 0.00 | 0.03 | 0.13 | 2.00 | 0.00 | 2.00 | 0.00 | -274.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 |
| max | 456255.00 | 1.00 | 19.00 | 117000000.00 | 4050000.00 | 258025.50 | 4050000.00 | 0.07 | -7489.00 | 365243.00 | 0.00 | 0.00 | 91.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 20.00 | 3.00 | 3.00 | 23.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.85 | 0.90 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 348.00 | 34.00 | 344.00 | 24.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | 9.00 | 8.00 | 27.00 | 261.00 | 25.00 |
Obsługa braków danych¶
W skoringu kredytowym braki danych można traktować na wiele sposobów. Ogolnie rzecz biorąc zmienne, które będą miały bardzo niskie braki danych (do maks 5%) będą imputowane medianą lub średnią. Zmienne, które mają około 5-20% pozostawiam jako braki danych, ponieważ potem będzie można z nich stworzyć osobne kategorie w trakcie dyskretyzacji, a taki % może nieść cenne informacje. Zmienne, które będą osiągać więcej niż 30-40% braków raczej usunę ze zbioru, ponieważ nie będą one praktyczne w zastosowaniu. Co do zmiennych kategorycznych to będą one miały nadaną osobną kategorie jako brak danych.
- Po sprawdzeniu informacji, które niesie każda ze zmiennych posiadająca wartości puste (braki danych) dla więcej niż 40% przypadków, już w tym miejscu zdecydowano się je usunąć.
- Nie stwierdzono szczególnie ważnej istotności zmiennych z biznesowego punktu widzenia
- Dodatkowo usunięte braki danych same w sobie nie niosą żadnych wartosciowych informacji
Usunięcie niepotrzebnych zmiennych¶
application = application.loc[:,(application.isnull().sum() / len(application)) <= 0.4]
Braki danych dla kolumn z mniej niż 1%¶
# Obliczenie procentowego udziału braków danych dla każdej kolumny
percent_missing = application.isna().mean() * 100
# Filtrowanie kolumn z mniej niż 1% braków danych
columns_with_less_than_1_percent_na = percent_missing[percent_missing < 1].index
# Uzupełnienie brakujących danych medianą dla tych kolumn
for col in columns_with_less_than_1_percent_na:
if application[col].dtype in ['float64', 'int64']: # Upewnienie, że kolumna jest typu numerycznego
application[col].fillna(application[col].median(), inplace=True)
Braki danych dla kolumn kategorycznych¶
categorical_columns = application.select_dtypes(include=['object', 'category']).columns
# Uzupełnienie brakujących danych
application[categorical_columns] = application[categorical_columns].fillna('Brak Danych')
Braki danych dla kolumny EXT_SOURCE_3 - zbadanie tej zmiennej w celu imputacji¶
plt.figure(figsize=(8, 4))
sns.boxplot(x='TARGET', y='EXT_SOURCE_3', data=application)
plt.title('Rozkład zmiennej EXT_SOURCE_3 w zależności od TARGET')
plt.xlabel('TARGET')
plt.ylabel('EXT_SOURCE_3')
plt.show()
* Można powiedzieć, że zmienna ta wskazuje na dość dobre rozróżnienie między defaultem i jego brakiem
* W związku z tym, pozostawiam ją w zbiorze danych i nie imputuje
Braki danych dla pozostałych kolumn z oryginalnej tabeli application¶
# kolumny: AMT_REQ_CREDIT_BUREAU_HOUR, AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_WEEK, AMT_REQ_CREDIT_BUREAU_MON, AMT_REQ_CREDIT_BUREAU_QRT, AMT_REQ_CREDIT_BUREAU_YEAR
columns_to_analyze = ['AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# Wykresy pudełkowe dla każdej z kolumn
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 10))
axes = axes.flatten() # Spłaszczenie tablicy osi, aby łatwiej nimi zarządzać
for i, col in enumerate(columns_to_analyze):
sns.boxplot(x='TARGET', y=col, data=application, ax=axes[i])
axes[i].set_title(f'Boxplot of {col} by TARGET')
axes[i].set_xlabel('TARGET')
axes[i].set_ylabel(col)
plt.tight_layout()
plt.show()
* Można powiedzieć, dla każdej z tych zmienncyh rozkład z reguły przyjmuje wartości bliskie zeru z pojedynczymi outlierami
* Jedyną zmienną która nie ma samych outlierów jest AMT_REQ_CREDIT_BUREAU_YEAR, która oznacza liczbę zapytań do BIK o klineta na rok przed złożeniem wniosku
* W związku z tym pozostałe 5 zmiennych usunę ze zbioru a zmienną roczną pozostawię z brakami, jak w przypadku EXT_SOURCE_3
columns_to_drop = ['AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT']
# Usunięcie wybranych kolumn z DataFrame
application = application.drop(columns=columns_to_drop)
Sprawdzenie tabeli po obsłudze braków danych¶
application.shape # pozostało 68 kolumn (zmiennych)
results = []
total_rows = len(application) #liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in application:
unique_values = application[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = application[column].dtype # Typ danych
null_count = application[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = application[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif application[column].dtype == 'object':
values_to_display = application[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
Po wstępnej selekcji zmiennych pod względem braków danych pozostało 68 kolumn. Dwie z nich mają zostawione braki danych.
Tabela "bureau_balance"¶
Sprawdzenie podstawowych charakterystyk zbioru¶
bureau_balance.shape
results = []
total_rows = len(bureau_balance) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in bureau_balance:
unique_values = bureau_balance[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = bureau_balance[column].dtype # Typ danych
null_count = bureau_balance[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = bureau_balance[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif bureau_balance[column].dtype == 'object':
values_to_display = bureau_balance[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
# Statystyki dla zmiennych liczbowych
bureau_balance.describe()
Obsługa braków danych¶
- Nie występują żadne braki danych w tabeli źródłowej "bureau_balance"
Feature Engineering¶
Dane z tabeli bureau_balance przedstawiają statusy transakcji w zależności od miesięy bilansowych. Żeby tą tabelę połączyć z tabelą "bureau" należy zagregować dane do poziomu SK_ID_BUREAU. Dane można byłoby połączyć również poziomo, jednak nie ma to sensu ponieważ wtedy liczba wierszy znacząco by wzrosła i zawierała bardzo dużo wartości pustych dla kolumn. Agregacja jest tutaj najelpszym wyborem.
# W zbiorze danych bureau_balance mamy 3 zmienne: id, miesiąc bilansowy oraz status w danym miesiącu
# Z punktu widzenia scoringu najważniejszy jest status, w związku z tym zdecydowałem się stworzyć zmienną, która będzie zliczać liczbę przypadków każdego statusu
# Dodatkowo przedstawiony został najwcześniejszy i najpóźniejszy miesiąc salda w stosunku do daty złożenia wniosku
# Agregacja MONTHS_BALANCE do znalezienia najwcześniejszego i najnowszego miesiąca dla każdego SK_ID_BUREAU
agg_bureau_balance = bureau_balance.groupby('SK_ID_BUREAU').agg(
EARLIEST_MONTHS_BALANCE=('MONTHS_BALANCE', 'min'),
LATEST_MONTHS_BALANCE=('MONTHS_BALANCE', 'max')
).reset_index()
# Liczenie wystąpień każdego statusu
status_counts = bureau_balance.pivot_table(index='SK_ID_BUREAU', columns='STATUS', aggfunc='size', fill_value=0)
# Dodawanie nazwy powstałych zmiennych statusowych z przodu: 'Status_Bureau_Balance_'
status_counts.columns = ['STATUS_' + str(col) for col in status_counts.columns]
# Łączenie agregatów z liczeniem statusów
bureau_balance_agg = pd.merge(agg_bureau_balance, status_counts, on='SK_ID_BUREAU', how='left')
# Dodanie do każdej kolumny przed nazwą skrótu BUREAU_BALANCE, żeby było wiadomo że te kolumny pochodzą z tabeli bureau_balance
bureau_balance_agg.columns = ['BUREAU_BALANCE_' + col if col != 'SK_ID_BUREAU' else col for col in bureau_balance_agg.columns] #zostawienie oryginalne nazwy klucza glównego
bureau_balance_agg.head(3)
Tak przygotowaną tabelę w dalszej części będzie można połączyć ze zbiorem "bureau".
Tabela "bureau"¶
Sprawdzenie podstawowych charakterystyk zbioru¶
bureau.shape
results = []
total_rows = len(bureau) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in bureau:
unique_values = bureau[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = bureau[column].dtype # Typ danych
null_count = bureau[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = bureau[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif bureau[column].dtype == 'object':
values_to_display = bureau[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
# Statystyki dla zmiennych liczbowych
bureau.describe()
Obsługa braków danych¶
Usunięcie niepotrzebnych zmiennych¶
bureau = bureau.loc[:,(bureau.isnull().sum() / len(bureau)) <= 0.4]
Sprawdzenie tabeli po obsłudze braków danych¶
bureau.shape
results = []
total_rows = len(bureau) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in bureau:
unique_values = bureau[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = bureau[column].dtype # Typ danych
null_count = bureau[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = bureau[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif bureau[column].dtype == 'object':
values_to_display = bureau[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
Feature Engineering dla tej tabeli zostanie przeprowadzony po dołączeniu do niej danych z bureau_balance -> poniżej.
Połączenie tabel "bureau_balance" i "bureau"¶
# Przed połaczeniem do tabeli Bureau dodanie skrótu do każdej zmiennej, żeby było wiadomo, że to jest Bureau
bureau.columns = ['BUREAU_' + col if col not in ('SK_ID_BUREAU', 'SK_ID_CURR') else col for col in bureau.columns] #zostawienie oryginalne nazwy klucza glównego
df_bik = pd.merge(bureau, bureau_balance_agg, on='SK_ID_BUREAU', how='left')
df_bik.head()
W wyniku połączenia obu tych zbiorów dla zagregowanych danych często wystepują braki danych, mogłyby one zostać uzupełnione w tym miejscu natomiast nie ma to sensu, ponieważ tutaj są one do pojedyczych wpisów w BIK, a i tak w dalszej części dane z BIK będą agregowane do osób wnioskujących, czyli SK_ID_CURR.
Tabela "df_bik"¶
Feature Engineering¶
# Agregacja danych numerycznych
agg_functions = {
'BUREAU_DAYS_CREDIT': ['mean', 'max', 'min'], # Ile dni przed bieżącym złożeniem wniosku klient złożył wniosek w BIK
'BUREAU_CREDIT_DAY_OVERDUE': ['mean', 'sum', 'max'], # Liczba dni przeterminowania kredytu w momencie składania wniosku o powiązaną pożyczkę
'BUREAU_DAYS_CREDIT_ENDDATE': ['mean', 'max', 'min'], # Pozostały czas trwania kredytu
'BUREAU_DAYS_ENDDATE_FACT': ['mean', 'max', 'min'], # Dni od zakończenia kredytu do wniosku, dotyczy tylko zakończonych kredytów (dlatego tyle braków, ale zostawiam)
'BUREAU_CNT_CREDIT_PROLONG': ['mean','sum'], # Ile razy był przedłużany kredyt w BIK
'BUREAU_AMT_CREDIT_SUM': ['mean', 'sum'], # Aktualna kwota kredytu w BIK
'BUREAU_AMT_CREDIT_SUM_DEBT': ['mean', 'sum'], # Aktualne zadłużenie na kredycie w BIK
'BUREAU_AMT_CREDIT_SUM_LIMIT': ['mean', 'sum'], # Aktualny limit kredytowy zgłoszony w BIK
'BUREAU_AMT_CREDIT_SUM_OVERDUE': ['mean', 'sum'], # Aktualna kwota zaległości w BIK
'BUREAU_DAYS_CREDIT_UPDATE': ['max', 'min'] # Ile dni przed złożeniem wniosku pojawiła się ostatnia informacja o kredycie w BIK
}
# Agregacja danych z df_bik
df_bik_agg = df_bik.groupby('SK_ID_CURR').agg(agg_functions)
df_bik_agg.columns = ['_'.join(col).upper() for col in df_bik_agg.columns.values] #dodanie nazw kolumn
df_bik_agg.reset_index(inplace=True)
# Agregacja dla CREDIT_ACTIVE poprzez mode, czyli najczęściej występowany status
most_frequent_credit_active = df_bik.groupby('SK_ID_CURR')['BUREAU_CREDIT_ACTIVE'].agg(
lambda x: x.mode()[0] if not x.mode().empty else 'UNKNOWN').reset_index(name='BUREAU_MOST_FREQ_CREDIT_ACTIVE')
# Agregacja dla CREDIT_CURRENCY również poprzez mode, czyli najczęściej występująca waluta kredytu w BIK
most_frequent_credit_currency = df_bik.groupby('SK_ID_CURR')['BUREAU_CREDIT_CURRENCY'].agg(
lambda x: x.mode()[0] if not x.mode().empty else 'UNKNOWN').reset_index(name='BUREAU_MOST_FREQ_CREDIT_CURRENCY')
# Agregacja dla CREDIT_TYPE również poprzez mode, czyli najczęściej występujący rodzaj kredytu w BIK
most_frequent_credit_type = df_bik.groupby('SK_ID_CURR')['BUREAU_CREDIT_TYPE'].agg(
lambda x: x.mode()[0] if not x.mode().empty else 'UNKNOWN').reset_index(name='BUREAU_MOST_FREQ_CREDIT_TYPE')
# Najwcześniejszy i najpóźniejszy MONTHS_BALANCE
earliest_months_balance = df_bik.groupby('SK_ID_CURR')['BUREAU_BALANCE_EARLIEST_MONTHS_BALANCE'].min().reset_index(name='BUREAU_BALANCE_EARLIEST_MONTHS_BALANCE')
latest_months_balance = df_bik.groupby('SK_ID_CURR')['BUREAU_BALANCE_LATEST_MONTHS_BALANCE'].max().reset_index(name='BUREAU_BALANCE_LATEST_MONTHS_BALANCE')
# Agregacja kolumn ze statusem bilansu, który wcześniej agregowałem do bureau z bureau_balance
status_columns = ['BUREAU_BALANCE_STATUS_0', 'BUREAU_BALANCE_STATUS_1', 'BUREAU_BALANCE_STATUS_2', 'BUREAU_BALANCE_STATUS_3', 'BUREAU_BALANCE_STATUS_4', 'BUREAU_BALANCE_STATUS_5', 'BUREAU_BALANCE_STATUS_C', 'BUREAU_BALANCE_STATUS_X']
status_sums = df_bik.groupby('SK_ID_CURR')[status_columns].sum().reset_index()
# Łączenie wszystkich zagregowanych wyników do df_bik_final
df_bik_final = df_bik_agg.merge(most_frequent_credit_active, on='SK_ID_CURR', how='left')\
.merge(most_frequent_credit_currency, on='SK_ID_CURR', how='left')\
.merge(most_frequent_credit_type, on='SK_ID_CURR', how='left')\
.merge(earliest_months_balance, on='SK_ID_CURR', how='left')\
.merge(latest_months_balance, on='SK_ID_CURR', how='left')\
.merge(status_sums, on='SK_ID_CURR', how='left')
Sprawdzenie danych po agregacji¶
df_bik_final.shape
results = []
total_rows = len(df_bik_final) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df_bik_final:
unique_values = df_bik_final[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df_bik_final[column].dtype # Typ danych
null_count = df_bik_final[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df_bik_final[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df_bik_final[column].dtype == 'object':
values_to_display = df_bik_final[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
Obsługa braków danych¶
# Decyduję się na usunięcie braków danych, tam gdzie jest więcej niż 40% czyli dwie nowo dodane zmienne EARLIEST_MONTHS_BALANCE i 'LATEST_MONTHS_BALANCE'
df_bik_final = df_bik_final.loc[:,(df_bik_final.isnull().sum() / len(df_bik_final)) <= 0.4]
# Mamy 4 kolumny, gdzie % nulli jest < 1%. W takim wypadku decyduje się na imputacje medianą
columns_to_fill = ['BUREAU_DAYS_CREDIT_ENDDATE_MEAN', 'BUREAU_DAYS_CREDIT_ENDDATE_MAX', 'BUREAU_DAYS_CREDIT_ENDDATE_MIN', 'BUREAU_AMT_CREDIT_SUM_MEAN', 'BUREAU_AMT_CREDIT_SUM_DEBT_MEAN']
for column in columns_to_fill:
median_value = df_bik_final[column].median()
df_bik_final[column].fillna(median_value, inplace=True)
# Decyduje się również usunąć zmienne, gdzie braki danych są powyżej 5% (nie możemy dać takich zmiennych, bo potem będę robił left joina z tabelą application i braki danych nie mogą w tym momencie występować, będą mogły po połączeniu z tabelą i niosą wtedy informacje, że nie ma ich w danej tabeli)
df_bik_final = df_bik_final.loc[:,(df_bik_final.isnull().sum() / len(df_bik_final)) <= 0.05]
Sprawdzenie zagregowanej tabeli po zmianach¶
# tabela df_bik_final ma ostatecznie 32 kolumny i 305811 wierszy
df_bik_final.shape
results = []
total_rows = len(df_bik_final) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df_bik_final:
unique_values = df_bik_final[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df_bik_final[column].dtype # Typ danych
null_count = df_bik_final[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df_bik_final[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df_bik_final[column].dtype == 'object':
values_to_display = df_bik_final[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
Taka tabela jest gotowa do połączenia jej w dalszym etapie z tabelą "application".
Tabela "previous_application"¶
Sprawdzenie podstawowych charakterystyk zbioru¶
previous_application.shape
results = []
total_rows = len(previous_application) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in previous_application:
unique_values = previous_application[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = previous_application[column].dtype # Typ danych
null_count = previous_application[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = previous_application[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif previous_application[column].dtype == 'object':
values_to_display = previous_application[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
# Statystyki dla zmiennych liczbowych
previous_application.describe()
Obsługa braków danych¶
Usunięcie niepotrzebnych zmiennych¶
previous_application = previous_application.loc[:,(previous_application.isnull().sum() / len(previous_application)) <= 0.4]
previous_application.shape
Pozostało 5 zmiennych z brakami danych. Na tym etapie nie będę ich obsługiwał, ponieważ zaraz i tak dane z tej tabeli będą agregowane do poziomu SK_ID_CURR, czyli tabeli aplikacyjnej, więc może braki nie będą stanowić problemu, a zmienne te wydają się na tyle ważne, żeby nie ingerować w ich wartości.
Feature Engineering¶
# Przed połaczeniem do tabeli installments_payments dodanie skrótu do każdej zmiennej
previous_application.columns = ['PREVIOUS_APPLICATION_' + col if col not in ('SK_ID_PREV', 'SK_ID_CURR') else col for col in previous_application.columns] #zostawienie oryginalne nazwy klucza glównego
# Nowa zmienna: stosunek AMT_APPLICATION do AMT_CREDIT (czyli stosunek wnioskowanej kwoty do otrzymanej - duży stosunek może świadczyć o słabszych klientach?)
previous_application['PREVIOUS_APPLICATION_RATIO_APP_TO_GET'] = previous_application['PREVIOUS_APPLICATION_AMT_APPLICATION'] / previous_application['PREVIOUS_APPLICATION_AMT_CREDIT']
previous_application['PREVIOUS_APPLICATION_RATIO_APP_TO_GET'].replace([np.inf, -np.inf], np.nan, inplace=True)
previous_application['PREVIOUS_APPLICATION_RATIO_APP_TO_GET'].fillna(0, inplace=True)
agg_functions = {
'PREVIOUS_APPLICATION_AMT_ANNUITY': ['mean', 'max', 'min', 'sum'],
'PREVIOUS_APPLICATION_AMT_APPLICATION': ['mean', 'max', 'min', 'sum'], # O ile klient prosił
'PREVIOUS_APPLICATION_AMT_CREDIT': ['mean', 'max', 'min', 'sum'], # Ile uzyskał
'PREVIOUS_APPLICATION_AMT_GOODS_PRICE': ['mean', 'max', 'min', 'sum'], # Cena towaru o ktory wnioskował
'PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START': ['mean', 'min','max'], # Godzina o której zaczął się proces
'PREVIOUS_APPLICATION_DAYS_DECISION': ['mean', 'max', 'min', 'sum'], # Jak dawno temu była podjęta decyzja
'PREVIOUS_APPLICATION_RATIO_APP_TO_GET': ['mean', 'max', 'min']
}
# Agregacja danych
df_previous_application_agg = previous_application.groupby('SK_ID_CURR').agg(agg_functions)
df_previous_application_agg.columns = ['_'.join(col).upper() for col in df_previous_application_agg.columns.values]
# Resetowanie indeksu
df_previous_application_agg.reset_index(inplace=True)
# Nowa zmienna: zliczanie liczby poprzednich aplikacji dla każdego klienta
df_previous_application_agg['PREVIOUS_APPLICATION_PREV_APPS_COUNT'] = previous_application.groupby('SK_ID_CURR')['SK_ID_PREV'].transform('size')
# Stworzenie zmiennych dummy, czyli np. mamy kolor który ma 3 warianty to tworzę 3 zmienne 0-1 czy dany rekord miał taki kolor czy nie, 1 rekord będzie miały tylko jedną jedynek w tych trzech zmiennych (przydatne do zmienny, gdzie występuje niewiele wariantów)
categorical_columns_dummy = [
'PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE', # Ile dany klient miał jakich umów
'PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START', # Ile dany klient miał wniosków w jakie dni tygodnia
'PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS', # Ile dany klient miał jakich statsów umowy
'PREVIOUS_APPLICATION_NAME_CLIENT_TYPE', # Ile dany klient był ponownie
'PREVIOUS_APPLICATION_CODE_REJECT_REASON', # Dlaczego poprzedni wniosek został odrzucony
'PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE', # Ile jaki produkt type
'PREVIOUS_APPLICATION_NAME_YIELD_GROUP'
]
# Stworzenie zmiennych dummy i ich agregacja
for cat_col in categorical_columns_dummy:
dummies = pd.get_dummies(previous_application[cat_col], prefix=cat_col)
dummies_agg = dummies.groupby(previous_application['SK_ID_CURR']).agg('sum')
if 'df_previous_application_agg' not in locals():
df_previous_application_agg = dummies_agg
else:
df_previous_application_agg = df_previous_application_agg.join(dummies_agg, on='SK_ID_CURR', how='left')
# Agregacja najczęściej występujących wartości
most_frequent_features = [
'PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE',
'PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE',
'PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY',
'PREVIOUS_APPLICATION_CHANNEL_TYPE',
'PREVIOUS_APPLICATION_PRODUCT_COMBINATION'
]
for feature in most_frequent_features:
mode_df = previous_application.groupby('SK_ID_CURR')[feature].agg(
lambda x: x.mode()[0] if not x.mode().empty else 'UNKNOWN'
).reset_index(name=f'PREVIOUS_APPLICATION_MOST_FREQ_{feature}')
df_previous_application_agg = df_previous_application_agg.merge(mode_df, on='SK_ID_CURR', how='left')
# Nazwy kolumn drukowane
df_previous_application_agg.columns = [col.upper() for col in df_previous_application_agg.columns]
Sprawdzenie zagregowanej tabeli previous_application po zmianach¶
results = []
total_rows = len(df_previous_application_agg) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df_previous_application_agg:
unique_values = df_previous_application_agg[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df_previous_application_agg[column].dtype # Typ danych
null_count = df_previous_application_agg[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df_previous_application_agg[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df_previous_application_agg[column].dtype == 'object':
values_to_display = df_previous_application_agg[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_CURR | int64 | 338857 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN | float64 | 311139 | > 5 unikatowych wartości liczbowych | 480 | 0.14 |
| 2 | PREVIOUS_APPLICATION_AMT_ANNUITY_MAX | float64 | 164390 | > 5 unikatowych wartości liczbowych | 480 | 0.14 |
| 3 | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | float64 | 159918 | > 5 unikatowych wartości liczbowych | 480 | 0.14 |
| 4 | PREVIOUS_APPLICATION_AMT_ANNUITY_SUM | float64 | 310963 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 5 | PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN | float64 | 218595 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 6 | PREVIOUS_APPLICATION_AMT_APPLICATION_MAX | float64 | 53054 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | PREVIOUS_APPLICATION_AMT_APPLICATION_MIN | float64 | 39315 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 8 | PREVIOUS_APPLICATION_AMT_APPLICATION_SUM | float64 | 198222 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 9 | PREVIOUS_APPLICATION_AMT_CREDIT_MEAN | float64 | 239733 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 10 | PREVIOUS_APPLICATION_AMT_CREDIT_MAX | float64 | 62833 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 11 | PREVIOUS_APPLICATION_AMT_CREDIT_MIN | float64 | 40983 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 12 | PREVIOUS_APPLICATION_AMT_CREDIT_SUM | float64 | 218294 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 13 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN | float64 | 211422 | > 5 unikatowych wartości liczbowych | 1064 | 0.31 |
| 14 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX | float64 | 53050 | > 5 unikatowych wartości liczbowych | 1064 | 0.31 |
| 15 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN | float64 | 50839 | > 5 unikatowych wartości liczbowych | 1064 | 0.31 |
| 16 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM | float64 | 198234 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 17 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN | float64 | 2761 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 18 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN | int64 | 24 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 19 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX | int64 | 24 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 20 | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | float64 | 65447 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 21 | PREVIOUS_APPLICATION_DAYS_DECISION_MAX | int64 | 2922 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 22 | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | int64 | 2921 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 23 | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | int64 | 21073 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 24 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | float64 | 290826 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 25 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | float64 | 159487 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 26 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN | float64 | 82444 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 27 | PREVIOUS_APPLICATION_PREV_APPS_COUNT | int64 | 67 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 28 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS | int64 | 60 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 29 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS | int64 | 37 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 30 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | int64 | 28 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 31 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_XNA | int64 | 4 | [0, 2, 1, 3] | 0 | 0.00 |
| 32 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY | int64 | 21 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 33 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY | int64 | 25 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 34 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY | int64 | 20 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 35 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SUNDAY | int64 | 22 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 36 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY | int64 | 23 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 37 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY | int64 | 21 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 38 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY | int64 | 22 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 39 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED | int64 | 26 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 40 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED | int64 | 40 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 41 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | int64 | 47 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 42 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_UNUSED OFFER | int64 | 13 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 43 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | int64 | 20 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 44 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED | int64 | 25 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 45 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER | int64 | 66 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 46 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_XNA | int64 | 10 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 47 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_CLIENT | int64 | 13 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 48 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | int64 | 38 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 49 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | int64 | 23 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 50 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO | int64 | 21 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 51 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR | int64 | 19 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 52 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SYSTEM | int64 | 11 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 53 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_VERIF | int64 | 9 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 54 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP | int64 | 48 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 55 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XNA | int64 | 12 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 56 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA | int64 | 51 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 57 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | int64 | 34 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 58 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_X-SELL | int64 | 35 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 59 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | int64 | 49 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 60 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | int64 | 31 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 61 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION | int64 | 24 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 62 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL | int64 | 26 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 63 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE | int64 | 29 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 64 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE_X | object | 25 | [XAP, XNA, Other, Repairs, Buying a used car, Buying a holiday home / land, Car repairs, Medicine, Building a house or an annex, Everyday expenses, Buying a new car, Urgent needs, Furniture, Journey, Education, Buying a home, Wedding / gift / holiday, Business development, Payments on other loans, Purchase of electronic equipment, Gasification / water supply, Hobby, Buying a garage, Money for a third person, Refusal to name the goal] | 0 | 0.00 |
| 65 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE_X | object | 4 | [Cash through the bank, XNA, Non-cash from your account, Cashless from the account of the employer] | 0 | 0.00 |
| 66 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY_X | object | 26 | [Mobile, Vehicles, Consumer Electronics, XNA, Audio/Video, Furniture, Computers, Construction Materials, Clothing and Accessories, Auto Accessories, Homewares, Photo / Cinema Equipment, Gardening, Office Appliances, Medicine, Jewelry, Weapon, Tourism, Fitness, Medical Supplies, Sport and Leisure, Other, Direct Sales, Education, Insurance, Additional Service] | 0 | 0.00 |
| 67 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE_X | object | 8 | [Country-wide, Stone, Regional / Local, Credit and cash offices, AP+ (Cash loan), Contact center, Channel of corporate sales, Car dealer] | 0 | 0.00 |
| 68 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE_Y | object | 25 | [XAP, XNA, Other, Repairs, Buying a used car, Buying a holiday home / land, Car repairs, Medicine, Building a house or an annex, Everyday expenses, Buying a new car, Urgent needs, Furniture, Journey, Education, Buying a home, Wedding / gift / holiday, Business development, Payments on other loans, Purchase of electronic equipment, Gasification / water supply, Hobby, Buying a garage, Money for a third person, Refusal to name the goal] | 0 | 0.00 |
| 69 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE_Y | object | 4 | [Cash through the bank, XNA, Non-cash from your account, Cashless from the account of the employer] | 0 | 0.00 |
| 70 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY_Y | object | 26 | [Mobile, Vehicles, Consumer Electronics, XNA, Audio/Video, Furniture, Computers, Construction Materials, Clothing and Accessories, Auto Accessories, Homewares, Photo / Cinema Equipment, Gardening, Office Appliances, Medicine, Jewelry, Weapon, Tourism, Fitness, Medical Supplies, Sport and Leisure, Other, Direct Sales, Education, Insurance, Additional Service] | 0 | 0.00 |
| 71 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE_Y | object | 8 | [Country-wide, Stone, Regional / Local, Credit and cash offices, AP+ (Cash loan), Contact center, Channel of corporate sales, Car dealer] | 0 | 0.00 |
| 72 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | object | 17 | [POS mobile with interest, POS other with interest, Cash X-Sell: low, POS mobile without interest, Cash, Cash X-Sell: middle, POS household with interest, POS industry without interest, Card X-Sell, Cash X-Sell: high, POS household without interest, POS industry with interest, Card Street, Cash Street: high, Cash Street: low, Cash Street: middle, POS others without interest] | 0 | 0.00 |
Obsługa braków danych po agregacji¶
# Zastąpienie braków danych medianami
median_1 = df_previous_application_agg['PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN'].median()
median_2 = df_previous_application_agg['PREVIOUS_APPLICATION_AMT_ANNUITY_MAX'].median()
median_3 = df_previous_application_agg['PREVIOUS_APPLICATION_AMT_ANNUITY_MIN'].median()
median_4 = df_previous_application_agg['PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN'].median()
median_5 = df_previous_application_agg['PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX'].median()
median_6 = df_previous_application_agg['PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN'].median()
df_previous_application_agg['PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN'].fillna(median_1, inplace=True)
df_previous_application_agg['PREVIOUS_APPLICATION_AMT_ANNUITY_MAX'].fillna(median_2, inplace=True)
df_previous_application_agg['PREVIOUS_APPLICATION_AMT_ANNUITY_MIN'].fillna(median_3, inplace=True)
df_previous_application_agg['PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN'].fillna(median_4, inplace=True)
df_previous_application_agg['PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX'].fillna(median_5, inplace=True)
df_previous_application_agg['PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN'].fillna(median_6, inplace=True)
# Sprawdzenie usunięcie braków danych: False -> brak jakichkolwiek braków danych
df_previous_application_agg.isnull().any().any()
False
Tabela "POS_CASH_balance"¶
Sprawdzenie podstawowych charakterystyk zbioru¶
POS_CASH_balance.shape
(10001358, 8)
results = []
total_rows = len(POS_CASH_balance) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in POS_CASH_balance:
unique_values = POS_CASH_balance[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = POS_CASH_balance[column].dtype # Typ danych
null_count = POS_CASH_balance[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = POS_CASH_balance[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif POS_CASH_balance[column].dtype == 'object':
values_to_display = POS_CASH_balance[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_PREV | int64 | 936325 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | SK_ID_CURR | int64 | 337252 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 2 | MONTHS_BALANCE | int64 | 96 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 3 | CNT_INSTALMENT | float64 | 73 | > 5 unikatowych wartości liczbowych | 26071 | 0.26 |
| 4 | CNT_INSTALMENT_FUTURE | float64 | 79 | > 5 unikatowych wartości liczbowych | 26087 | 0.26 |
| 5 | NAME_CONTRACT_STATUS | object | 9 | [Active, Completed, Signed, Approved, Returned to the store, Demand, Canceled, XNA, Amortized debt] | 0 | 0.00 |
| 6 | SK_DPD | int64 | 3400 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | SK_DPD_DEF | int64 | 2307 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
# Statystyki dla zmiennych liczbowych
POS_CASH_balance.describe()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|
| count | 10001358.00 | 10001358.00 | 10001358.00 | 9975287.00 | 9975271.00 | 10001358.00 | 10001358.00 |
| mean | 1903216.60 | 278403.86 | -35.01 | 17.09 | 10.48 | 11.61 | 0.65 |
| std | 535846.53 | 102763.75 | 26.07 | 12.00 | 11.11 | 132.71 | 32.76 |
| min | 1000001.00 | 100001.00 | -96.00 | 1.00 | 0.00 | 0.00 | 0.00 |
| 25% | 1434405.00 | 189550.00 | -54.00 | 10.00 | 3.00 | 0.00 | 0.00 |
| 50% | 1896565.00 | 278654.00 | -28.00 | 12.00 | 7.00 | 0.00 | 0.00 |
| 75% | 2368963.00 | 367429.00 | -13.00 | 24.00 | 14.00 | 0.00 | 0.00 |
| max | 2843499.00 | 456255.00 | -1.00 | 92.00 | 85.00 | 4231.00 | 3595.00 |
* Można zaobserwować braki danych dla dwóch kolumn: CNT_INSTALMENT (liczba rat kredtu) oraz CNT_INSTALMENT_FUTURE(liczba rat pozostałych do spłaty).
* Braki danych zostaną sprawdzone dopiero po agregacji tej tabeli do poziomu SK_ID_CURR
Feature Engineering¶
# Przed połaczeniem do tabeli POS_CASH dodanie skrótu do każdej zmiennej
POS_CASH_balance.columns = ['POS_CASH_' + col if col not in ('SK_ID_PREV', 'SK_ID_CURR') else col for col in POS_CASH_balance.columns] #zostawienie oryginalne nazwy klucza glównego
# Agregacja danych numerycznych
agg_functions = {
'POS_CASH_MONTHS_BALANCE': ['min', 'max', 'mean'], # Najwcześniejszy, najpóźniejszy i średni miesiąc salda
'POS_CASH_CNT_INSTALMENT': ['mean', 'sum', 'max', 'min'], # Średnia, suma, maksimum i minimum liczby rat
'POS_CASH_CNT_INSTALMENT_FUTURE': ['mean', 'sum', 'max', 'min'], # Średnia, suma, maksimum i minimum przyszłych rat
'POS_CASH_SK_DPD': ['max', 'mean'], # Maksymalne i średnie dni opóźnienia (z biznesowego punktu widzenia nie ma sensu minimum)
'POS_CASH_SK_DPD_DEF': ['max', 'mean'] # Maksymalne i średnie dni opóźnienia po zastosowaniu ulg (z biznesowego punktu widzenia nie ma sensu minimum)
}
# Agregacja danych
df_POS_CASH_agg = POS_CASH_balance.groupby('SK_ID_CURR').agg(agg_functions)
df_POS_CASH_agg.columns = ['_'.join(col).upper() for col in df_POS_CASH_agg.columns.values] #dodanie nazw kolumn
df_POS_CASH_agg.reset_index(inplace=True)
# Agregacja dla NAME_CONTRACT_STATUS, liczenie wystąpień każdego statusu
status_counts = POS_CASH_balance.pivot_table(index='SK_ID_CURR', columns='POS_CASH_NAME_CONTRACT_STATUS', aggfunc='size', fill_value=0)
# Dodawanie nazwy powstałych zmiennych statusowych z przodu: 'Status_Bureau_Balance_'
status_counts.columns = ['POS_CASH_NAME_CONTRACT_STATUS_' + str(col) for col in status_counts.columns]
# Dodatkowa zmienna: Liczenie liczby unikalnych wpisów POS_CASH dla każdego klienta (SK_ID_CURR) - czyli sprawdzenie ile jest unikalnych SK_ID_PREV dla każdego SK_ID_CURR
POS_CASH_APPLICATION_COUNT = POS_CASH_balance.groupby('SK_ID_CURR').agg(POS_CASH_APP_COUNT=('SK_ID_PREV', 'nunique')).reset_index()
# Łączenie agregatów z liczeniem statusów
df_POS_CASH_agg = pd.merge(df_POS_CASH_agg, status_counts, on='SK_ID_CURR', how='left')
df_POS_CASH_agg = pd.merge(df_POS_CASH_agg, POS_CASH_APPLICATION_COUNT, on='SK_ID_CURR', how='left')
df_POS_CASH_agg.columns = [col.upper() for col in df_POS_CASH_agg.columns]
Sprawdzenie zagregowanej tabeli POS_CASH¶
df_POS_CASH_agg.head(1)
| SK_ID_CURR | POS_CASH_MONTHS_BALANCE_MIN | POS_CASH_MONTHS_BALANCE_MAX | POS_CASH_MONTHS_BALANCE_MEAN | POS_CASH_CNT_INSTALMENT_MEAN | POS_CASH_CNT_INSTALMENT_SUM | POS_CASH_CNT_INSTALMENT_MAX | POS_CASH_CNT_INSTALMENT_MIN | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | POS_CASH_CNT_INSTALMENT_FUTURE_MIN | POS_CASH_SK_DPD_MAX | POS_CASH_SK_DPD_MEAN | POS_CASH_SK_DPD_DEF_MAX | POS_CASH_SK_DPD_DEF_MEAN | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | POS_CASH_NAME_CONTRACT_STATUS_AMORTIZED DEBT | POS_CASH_NAME_CONTRACT_STATUS_APPROVED | POS_CASH_NAME_CONTRACT_STATUS_CANCELED | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | POS_CASH_NAME_CONTRACT_STATUS_DEMAND | POS_CASH_NAME_CONTRACT_STATUS_RETURNED TO THE STORE | POS_CASH_NAME_CONTRACT_STATUS_SIGNED | POS_CASH_NAME_CONTRACT_STATUS_XNA | POS_CASH_APP_COUNT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | -96 | -53 | -72.56 | 4.00 | 36.00 | 4.00 | 4.00 | 1.44 | 13.00 | 4.00 | 0.00 | 7 | 0.78 | 7 | 0.78 | 7 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 |
results = []
total_rows = len(df_POS_CASH_agg) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df_POS_CASH_agg:
unique_values = df_POS_CASH_agg[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df_POS_CASH_agg[column].dtype # Typ danych
null_count = df_POS_CASH_agg[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df_POS_CASH_agg[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df_POS_CASH_agg[column].dtype == 'object':
values_to_display = df_POS_CASH_agg[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_CURR | int64 | 337252 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | POS_CASH_MONTHS_BALANCE_MIN | int64 | 96 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 2 | POS_CASH_MONTHS_BALANCE_MAX | int64 | 96 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 3 | POS_CASH_MONTHS_BALANCE_MEAN | float64 | 68637 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 4 | POS_CASH_CNT_INSTALMENT_MEAN | float64 | 45080 | > 5 unikatowych wartości liczbowych | 28 | 0.01 |
| 5 | POS_CASH_CNT_INSTALMENT_SUM | float64 | 4323 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 6 | POS_CASH_CNT_INSTALMENT_MAX | float64 | 65 | > 5 unikatowych wartości liczbowych | 28 | 0.01 |
| 7 | POS_CASH_CNT_INSTALMENT_MIN | float64 | 58 | > 5 unikatowych wartości liczbowych | 28 | 0.01 |
| 8 | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | float64 | 43319 | > 5 unikatowych wartości liczbowych | 28 | 0.01 |
| 9 | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | float64 | 2966 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 10 | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | float64 | 65 | > 5 unikatowych wartości liczbowych | 28 | 0.01 |
| 11 | POS_CASH_CNT_INSTALMENT_FUTURE_MIN | float64 | 61 | > 5 unikatowych wartości liczbowych | 28 | 0.01 |
| 12 | POS_CASH_SK_DPD_MAX | int64 | 2025 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 13 | POS_CASH_SK_DPD_MEAN | float64 | 11737 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 14 | POS_CASH_SK_DPD_DEF_MAX | int64 | 217 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 15 | POS_CASH_SK_DPD_DEF_MEAN | float64 | 4722 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 16 | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | int64 | 217 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 17 | POS_CASH_NAME_CONTRACT_STATUS_AMORTIZED DEBT | int64 | 14 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 18 | POS_CASH_NAME_CONTRACT_STATUS_APPROVED | int64 | 14 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 19 | POS_CASH_NAME_CONTRACT_STATUS_CANCELED | int64 | 2 | [0, 1] | 0 | 0.00 |
| 20 | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | int64 | 52 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 21 | POS_CASH_NAME_CONTRACT_STATUS_DEMAND | int64 | 59 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 22 | POS_CASH_NAME_CONTRACT_STATUS_RETURNED TO THE STORE | int64 | 5 | [0, 1, 2, 3, 4] | 0 | 0.00 |
| 23 | POS_CASH_NAME_CONTRACT_STATUS_SIGNED | int64 | 32 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 24 | POS_CASH_NAME_CONTRACT_STATUS_XNA | int64 | 2 | [0, 1] | 0 | 0.00 |
| 25 | POS_CASH_APP_COUNT | int64 | 25 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
Obsługa braków danych po agregacji¶
# Zastąpienie braków danych medianami
median_1 = df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_MEAN'].median()
median_2 = df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_MAX'].median()
median_3 = df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_MIN'].median()
median_4 = df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_FUTURE_MEAN'].median()
median_5 = df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_FUTURE_MAX'].median()
median_6 = df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_FUTURE_MIN'].median()
df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_MEAN'].fillna(median_1, inplace=True)
df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_MAX'].fillna(median_2, inplace=True)
df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_MIN'].fillna(median_3, inplace=True)
df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_FUTURE_MEAN'].fillna(median_4, inplace=True)
df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_FUTURE_MAX'].fillna(median_5, inplace=True)
df_POS_CASH_agg['POS_CASH_CNT_INSTALMENT_FUTURE_MIN'].fillna(median_6, inplace=True)
# Sprawdzenie usunięcie braków danych: False -> brak jakichkolwiek braków danych
df_previous_application_agg.isnull().any().any()
False
Tabela "installments_payments"¶
Sprawdzenie podstawowych charakterystyk zbioru¶
installments_payments.shape
(13605401, 8)
results = []
total_rows = len(installments_payments) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in installments_payments:
unique_values = installments_payments[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = installments_payments[column].dtype # Typ danych
null_count = installments_payments[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = installments_payments[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif installments_payments[column].dtype == 'object':
values_to_display = installments_payments[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_PREV | int64 | 997752 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | SK_ID_CURR | int64 | 339587 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 2 | NUM_INSTALMENT_VERSION | float64 | 65 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 3 | NUM_INSTALMENT_NUMBER | int64 | 277 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 4 | DAYS_INSTALMENT | float64 | 2922 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 5 | DAYS_ENTRY_PAYMENT | float64 | 3039 | > 5 unikatowych wartości liczbowych | 2905 | 0.02 |
| 6 | AMT_INSTALMENT | float64 | 902539 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | AMT_PAYMENT | float64 | 944235 | > 5 unikatowych wartości liczbowych | 2905 | 0.02 |
# Statystyki dla zmiennych liczbowych
installments_payments.describe()
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| count | 13605401.00 | 13605401.00 | 13605401.00 | 13605401.00 | 13605401.00 | 13602496.00 | 13605401.00 | 13602496.00 |
| mean | 1903364.97 | 278444.88 | 0.86 | 18.87 | -1042.27 | -1051.11 | 17050.91 | 17238.22 |
| std | 536202.91 | 102718.31 | 1.04 | 26.66 | 800.95 | 800.59 | 50570.25 | 54735.78 |
| min | 1000001.00 | 100001.00 | 0.00 | 1.00 | -2922.00 | -4921.00 | 0.00 | 0.00 |
| 25% | 1434191.00 | 189639.00 | 0.00 | 4.00 | -1654.00 | -1662.00 | 4226.09 | 3398.26 |
| 50% | 1896520.00 | 278685.00 | 1.00 | 8.00 | -818.00 | -827.00 | 8884.08 | 8125.52 |
| 75% | 2369094.00 | 367530.00 | 1.00 | 19.00 | -361.00 | -370.00 | 16710.21 | 16108.42 |
| max | 2843499.00 | 456255.00 | 178.00 | 277.00 | -1.00 | -1.00 | 3771487.85 | 3771487.85 |
Obsługa braków danych¶
* Braki danych sa niewielkie i występują dla dwóch kolumn DAYS_ENTRY_PAYMENT i AMT_PAYMENT.
* Na podstawie opisu zmiennych w dokumnetacji, można twierdzić że zmienne te są odpowiednio powiązane z DAYS_INSTALMENT i AMT_INSTALMENT.
* Zmienne z dopiskiem instalment oznaczają przewidywane wartości, natoamist payment dotyczą faktycznych.
* W związku z tym po sprawdzeniu zależności można uzupełnić brakujące dane z innych, powiązanych zmiennych
Sprawdzenie powiązania zmiennych DAYS_INSTALMENT z DAYS_ENTRY_PAYMENT oraz AMT_INSTALMENT z AMT_PAYMENT¶
# Tworzenie wykresów
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
# Wykres dla AMT_INSTALMENT i AMT_PAYMENT
sns.scatterplot(x='AMT_INSTALMENT', y='AMT_PAYMENT', data=installments_payments, alpha=0.5, ax=axs[0])
axs[0].set_title('Zależność między AMT_INSTALMENT a AMT_PAYMENT')
axs[0].set_xlabel('AMT_INSTALMENT')
axs[0].set_ylabel('AMT_PAYMENT')
# Wykres dla DAYS_INSTALMENT i DAYS_ENTRY_PAYMENT
sns.scatterplot(x='DAYS_INSTALMENT', y='DAYS_ENTRY_PAYMENT', data=installments_payments, alpha=0.5, ax=axs[1])
axs[1].set_title('Zależność między DAYS_INSTALMENT a DAYS_ENTRY_PAYMENT')
axs[1].set_xlabel('DAYS_INSTALMENT')
axs[1].set_ylabel('DAYS_ENTRY_PAYMENT')
plt.tight_layout()
plt.show()
Można zauważyć, że jest wiele obserwacji typowo liniowych, gdzie przewidywane równają się z faktycznymi. W związku z tym dla braków danych, których jest bardzo niewiele zostaną one zastąpione wartościami z powiązanych zmiennych.
Uzupełnienie braków danych powiązanymi zmiennymi¶
# Zastępowanie braków danych w DAYS_ENTRY_PAYMENT wartościami z DAYS_INSTALMENT
installments_payments['DAYS_ENTRY_PAYMENT'].fillna(installments_payments['DAYS_INSTALMENT'], inplace=True)
# Zastępowanie braków danych w AMT_PAYMENT wartościami z AMT_INSTALMENT
installments_payments['AMT_PAYMENT'].fillna(installments_payments['AMT_INSTALMENT'], inplace=True)
Feature Engineering¶
Dodanie nowej zmiennej "IS_DELAYED"¶
Dodatkowo z powyższych wykresów, szczególnie dotyczących liczby dni, kiedy miała zostać spłacona rata a kiedy faktycznie została, można zaobserwować, że dość często płatności były opóźnione. Zdecydowałem się stworzyć nową zmienną z informacją czy ktoś się spóźniał z płatnościami (jeżeli DAYS_ENTRY_PAYMENT > DAYS_INSTALMENT -> czyli mniejszy minus to znaczy spóźnienie)
#dodanie nowej zmiennej mówiącej o opóźnieniu raty
installments_payments['IS_DELAYED'] = (installments_payments['DAYS_ENTRY_PAYMENT'] > installments_payments['DAYS_INSTALMENT']).astype(int)
# Dodanie skrótu do każdej zmiennej
installments_payments.columns = ['INSTALLMENTS_PAYMENTS_' + col if col not in ('SK_ID_PREV', 'SK_ID_CURR') else col for col in installments_payments.columns] #zostawienie oryginalne nazwy klucza glównego
#Dodanie nowych zmiennych
# Zmienność kwot rat
installments_payments['INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE'] = installments_payments.groupby('SK_ID_PREV')['INSTALLMENTS_PAYMENTS_AMT_PAYMENT'].transform('std')
# Różnica między planowaną a rzeczywistą kwotą płatności (jako niedopłaty - planowana większa niż zapłacona)
installments_payments['INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT'] = installments_payments['INSTALLMENTS_PAYMENTS_AMT_INSTALMENT'] - installments_payments['INSTALLMENTS_PAYMENTS_AMT_PAYMENT']
# Częstotliwość zmian w harmonogramie rat
installments_payments['INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_CHANGE'] = installments_payments.groupby('SK_ID_PREV')['INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION'].nunique()
# Agregacja zmiennych
agg_functions = {
'INSTALLMENTS_PAYMENTS_IS_DELAYED': ['sum', 'mean'], # Łączna liczba opóźnień i średnia (przy średniej jeżeli jest 1 to znaczy że tylko opóźnienia, im bliżej 1 tym gorzej)
'INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION': ['nunique'], # Liczba unikalnych wersji rat
# Dodanie nowych metryk dla zmienności kwot rat
'INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE': ['mean', 'max'],
'INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT': ['mean', 'sum', 'max'],
}
# Agregacja danych
df_installments_payments_agg = installments_payments.groupby('SK_ID_CURR').agg(agg_functions)
df_installments_payments_agg.columns = ['_'.join(col).upper() for col in df_installments_payments_agg.columns.values]
df_installments_payments_agg.reset_index(inplace=True)
# Dodanie ostatniej zmiennej: procentu opóźnień względem wszystkich płatności
df_installments_payments_agg['INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED'] = (df_installments_payments_agg['INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM'] / installments_payments.groupby('SK_ID_CURR')['INSTALLMENTS_PAYMENTS_IS_DELAYED'].transform('count') * 100).fillna(0)
#jak wygląda liczba przypadków w niespłacalności rat
#df_installments_payments_agg['INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM'].value_counts()
Sprawdzenie zagregowanej tabeli installments_paymnets po zmianach¶
df_installments_payments_agg.head(1)
| SK_ID_CURR | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 1 | 0.14 | 2 | 3842.25 | 6723.45 | 0.00 | 0.00 | 0.00 | 0.99 |
results = []
total_rows = len(df_installments_payments_agg) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df_installments_payments_agg:
unique_values = df_installments_payments_agg[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df_installments_payments_agg[column].dtype # Typ danych
null_count = df_installments_payments_agg[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df_installments_payments_agg[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df_installments_payments_agg[column].dtype == 'object':
values_to_display = df_installments_payments_agg[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_CURR | int64 | 339587 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | int32 | 103 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 2 | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | float64 | 5237 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 3 | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | int64 | 50 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 4 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | float64 | 318328 | > 5 unikatowych wartości liczbowych | 1018 | 0.30 |
| 5 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | float64 | 304350 | > 5 unikatowych wartości liczbowych | 1018 | 0.30 |
| 6 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | float64 | 160508 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | float64 | 148109 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 8 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | float64 | 119803 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 9 | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | float64 | 5865 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
# Zastąpienie braków danych medianami dla kolumn INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN oraz INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX
median_1 = df_installments_payments_agg['INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN'].median()
median_2 = df_installments_payments_agg['INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX'].median()
df_installments_payments_agg['INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN'].fillna(median_1, inplace=True)
df_installments_payments_agg['INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX'].fillna(median_2, inplace=True)
# Sprawdzenie usunięcia braków danych: False -> brak jakichkolwiek braków danych
df_installments_payments_agg.isnull().any().any()
False
Tabela "credit_card_balance"¶
Sprawdzenie podstawowych charakterystyk zbioru¶
credit_card_balance.shape
(3840312, 23)
results = []
total_rows = len(credit_card_balance) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in credit_card_balance:
unique_values = credit_card_balance[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = credit_card_balance[column].dtype # Typ danych
null_count = credit_card_balance[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = credit_card_balance[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif credit_card_balance[column].dtype == 'object':
values_to_display = credit_card_balance[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_PREV | int64 | 104307 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | SK_ID_CURR | int64 | 103558 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 2 | MONTHS_BALANCE | int64 | 96 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 3 | AMT_BALANCE | float64 | 1347904 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 4 | AMT_CREDIT_LIMIT_ACTUAL | int64 | 181 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 5 | AMT_DRAWINGS_ATM_CURRENT | float64 | 2267 | > 5 unikatowych wartości liczbowych | 749816 | 19.52 |
| 6 | AMT_DRAWINGS_CURRENT | float64 | 187005 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | AMT_DRAWINGS_OTHER_CURRENT | float64 | 1832 | > 5 unikatowych wartości liczbowych | 749816 | 19.52 |
| 8 | AMT_DRAWINGS_POS_CURRENT | float64 | 168748 | > 5 unikatowych wartości liczbowych | 749816 | 19.52 |
| 9 | AMT_INST_MIN_REGULARITY | float64 | 312266 | > 5 unikatowych wartości liczbowych | 305236 | 7.95 |
| 10 | AMT_PAYMENT_CURRENT | float64 | 163209 | > 5 unikatowych wartości liczbowych | 767988 | 20.00 |
| 11 | AMT_PAYMENT_TOTAL_CURRENT | float64 | 182957 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 12 | AMT_RECEIVABLE_PRINCIPAL | float64 | 1195839 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 13 | AMT_RECIVABLE | float64 | 1338878 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 14 | AMT_TOTAL_RECEIVABLE | float64 | 1339008 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 15 | CNT_DRAWINGS_ATM_CURRENT | float64 | 44 | > 5 unikatowych wartości liczbowych | 749816 | 19.52 |
| 16 | CNT_DRAWINGS_CURRENT | int64 | 129 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 17 | CNT_DRAWINGS_OTHER_CURRENT | float64 | 11 | > 5 unikatowych wartości liczbowych | 749816 | 19.52 |
| 18 | CNT_DRAWINGS_POS_CURRENT | float64 | 133 | > 5 unikatowych wartości liczbowych | 749816 | 19.52 |
| 19 | CNT_INSTALMENT_MATURE_CUM | float64 | 121 | > 5 unikatowych wartości liczbowych | 305236 | 7.95 |
| 20 | NAME_CONTRACT_STATUS | object | 7 | [Active, Completed, Demand, Signed, Sent proposal, Refused, Approved] | 0 | 0.00 |
| 21 | SK_DPD | int64 | 917 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 22 | SK_DPD_DEF | int64 | 378 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
# Statystyki dla zmiennych liczbowych
credit_card_balance.describe()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3840312.00 | 3840312.00 | 3840312.00 | 3840312.00 | 3840312.00 | 3090496.00 | 3840312.00 | 3090496.00 | 3090496.00 | 3535076.00 | 3072324.00 | 3840312.00 | 3840312.00 | 3840312.00 | 3840312.00 | 3090496.00 | 3840312.00 | 3090496.00 | 3090496.00 | 3535076.00 | 3840312.00 | 3840312.00 |
| mean | 1904503.59 | 278324.21 | -34.52 | 58300.16 | 153807.96 | 5961.32 | 7433.39 | 288.17 | 2968.80 | 3540.20 | 10280.54 | 7588.86 | 55965.88 | 58088.81 | 58098.29 | 0.31 | 0.70 | 0.00 | 0.56 | 20.83 | 9.28 | 0.33 |
| std | 536469.47 | 102704.48 | 26.67 | 106307.03 | 165145.70 | 28225.69 | 33846.08 | 8201.99 | 20796.89 | 5600.15 | 36078.08 | 32005.99 | 102533.62 | 105965.37 | 105971.80 | 1.10 | 3.19 | 0.08 | 3.24 | 20.05 | 97.52 | 21.48 |
| min | 1000018.00 | 100006.00 | -96.00 | -420250.18 | 0.00 | -6827.31 | -6211.62 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -423305.82 | -420250.18 | -420250.18 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 1434385.00 | 189517.00 | -55.00 | 0.00 | 45000.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 152.37 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | 0.00 | 0.00 |
| 50% | 1897122.00 | 278396.00 | -28.00 | 0.00 | 112500.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2702.70 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 15.00 | 0.00 | 0.00 |
| 75% | 2369327.75 | 367580.00 | -11.00 | 89046.69 | 180000.00 | 0.00 | 0.00 | 0.00 | 0.00 | 6633.91 | 9000.00 | 6750.00 | 85359.24 | 88899.49 | 88914.51 | 0.00 | 0.00 | 0.00 | 0.00 | 32.00 | 0.00 | 0.00 |
| max | 2843496.00 | 456250.00 | -1.00 | 1505902.19 | 1350000.00 | 2115000.00 | 2287098.31 | 1529847.00 | 2239274.16 | 202882.01 | 4289207.45 | 4278315.69 | 1472316.79 | 1493338.19 | 1493338.19 | 51.00 | 165.00 | 12.00 | 165.00 | 120.00 | 3260.00 | 3260.00 |
* Zmienne AMT_DRAWINGS_ATM_CURRENT, AMT_DRAWINGS_OTHER_CURRENT, AMT_DRAWINGS_POS_CURRENT decyduję się usunąć ze zbioru ponieważ posiadają praktycznie 20% braki danych, a niosą podobne informacje co zmienna AMT_DRAWINGS_CURRENT
* Tak samo zmienne: CNT_DRAWINGS_ATM_CURRENT, CNT_DRAWINGS_OTHER_CURRENT, CNT_DRAWINGS_POS_CURRENT niosą podobne informacje jak zmienna CNT_DRAWINGS_CURRENT
* Zmienna AMT_INST_MIN_REGULARITY również zostaje usunięta z modelu
* Tak samo zmienna AMT_PAYMENT_CURRENT, podobną informację niesie za sobą AMT_PAYMENT_TOTAL_CURRENT
columns_to_drop = ['AMT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_DRAWINGS_POS_CURRENT', 'AMT_INST_MIN_REGULARITY', 'AMT_PAYMENT_CURRENT', 'CNT_DRAWINGS_ATM_CURRENT', 'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT']
# Usunięcie wybranych kolumn z DataFrame
credit_card_balance = credit_card_balance.drop(columns=columns_to_drop)
Feature Engineering¶
# Przed połaczeniem do tabeli installments_payments dodanie skrótu do każdej zmiennej
credit_card_balance.columns = ['CREDIT_CARD_BALANCE_' + col if col not in ('SK_ID_PREV', 'SK_ID_CURR') else col for col in credit_card_balance.columns] #zostawienie oryginalne nazwy klucza glównego
# Agregacje dla istniejących zmiennych
agg_functions = {
'CREDIT_CARD_BALANCE_MONTHS_BALANCE': ['min', 'max', 'mean'], # Najwcześniejszy, najpóźniejszy i średni miesiąc salda
'CREDIT_CARD_BALANCE_AMT_BALANCE': ['sum', 'mean', 'max', 'min', 'std'], # Statystyki salda w miesiącu
'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL': ['sum', 'mean', 'max', 'min', 'std'], # Jaki był limit karty kredytowej
'CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT': ['sum', 'mean', 'max', 'min', 'std'], # Pobrane kwoty
'CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT': ['sum', 'mean', 'max', 'min'], # Wpłaty klienta
'CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE': ['sum', 'mean', 'max', 'min', 'std'], # Kwota należności z tytułu karty kredytowej
'CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT': ['sum', 'mean', 'max', 'min'], # Liczba wpłat w miesiącu
'CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM': ['sum', 'mean', 'max', 'min'], # Liczba opłacony rat
'CREDIT_CARD_BALANCE_SK_DPD': ['max', 'mean', 'sum'], # Dni przeterminowania
'CREDIT_CARD_BALANCE_SK_DPD_DEF': ['max', 'mean', 'sum'] #Dni przeterminowania z ulgą
}
# Agregacja danych
df_credit_card_balance_agg = credit_card_balance.groupby('SK_ID_CURR').agg(agg_functions)
# Nowa zmienna - wykorzystanie kredytu
credit_card_balance['CREDIT_CARD_BALANCE_CREDIT_USE_RATE'] = credit_card_balance['CREDIT_CARD_BALANCE_AMT_BALANCE'] / credit_card_balance['CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL'] # Wykorzystanie kredytu
# Zastąpienie nieskończoności i NaN wynikających z dzielenia przez zero
credit_card_balance.replace([np.inf, -np.inf], np.nan, inplace=True)
credit_card_balance.fillna(0, inplace=True)
# Agregacja nowych zmiennych
new_agg_functions = {
'CREDIT_CARD_BALANCE_CREDIT_USE_RATE': ['mean', 'max'],
}
df_new_credit_card_balance_agg = credit_card_balance.groupby('SK_ID_CURR').agg(new_agg_functions)
# Połączenie
df_credit_card_balance_agg = df_credit_card_balance_agg.join(df_new_credit_card_balance_agg, on='SK_ID_CURR', how='left', rsuffix='_NEW')
# Zmiana nazw kolumn i resetowanie indeksu
df_credit_card_balance_agg.columns = ['_'.join(col).upper() for col in df_credit_card_balance_agg.columns.values]
df_credit_card_balance_agg.reset_index(inplace=True)
# Agregacja dla kategorii NAME_CONTRACT_STATUS
contract_status_agg = credit_card_balance.groupby('SK_ID_CURR')['CREDIT_CARD_BALANCE_NAME_CONTRACT_STATUS'].value_counts().unstack(fill_value=0)
contract_status_agg.columns = ['CREDIT_CARD_BALANCE_STATUS_' + col for col in contract_status_agg.columns]
# Dołączanie agregacji statusu umowy
df_credit_card_balance_agg = df_credit_card_balance_agg.join(contract_status_agg, on='SK_ID_CURR', how='left')
Sprawdzenie zagregowanej tabeli credit_card_balance¶
results = []
total_rows = len(df_credit_card_balance_agg) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df_credit_card_balance_agg:
unique_values = df_credit_card_balance_agg[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df_credit_card_balance_agg[column].dtype # Typ danych
null_count = df_credit_card_balance_agg[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df_credit_card_balance_agg[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df_credit_card_balance_agg[column].dtype == 'object':
values_to_display = df_credit_card_balance_agg[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_CURR | int64 | 103558 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN | int64 | 96 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 2 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX | int64 | 7 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 3 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN | float64 | 570 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 4 | CREDIT_CARD_BALANCE_AMT_BALANCE_SUM | float64 | 69080 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 5 | CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN | float64 | 70080 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 6 | CREDIT_CARD_BALANCE_AMT_BALANCE_MAX | float64 | 66374 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | CREDIT_CARD_BALANCE_AMT_BALANCE_MIN | float64 | 13320 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 8 | CREDIT_CARD_BALANCE_AMT_BALANCE_STD | float64 | 69961 | > 5 unikatowych wartości liczbowych | 692 | 0.67 |
| 9 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM | int64 | 2843 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 10 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MEAN | float64 | 13036 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 11 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MAX | int64 | 54 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 12 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MIN | int64 | 180 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 13 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD | float64 | 26234 | > 5 unikatowych wartości liczbowych | 692 | 0.67 |
| 14 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM | float64 | 47379 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 15 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN | float64 | 57397 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 16 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX | float64 | 28333 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 17 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MIN | float64 | 2363 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 18 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD | float64 | 65572 | > 5 unikatowych wartości liczbowych | 692 | 0.67 |
| 19 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM | float64 | 61827 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 20 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN | float64 | 67932 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 21 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX | float64 | 35265 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 22 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MIN | float64 | 1754 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 23 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM | float64 | 69199 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 24 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN | float64 | 70224 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 25 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX | float64 | 66012 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 26 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN | float64 | 22496 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 27 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD | float64 | 70145 | > 5 unikatowych wartości liczbowych | 692 | 0.67 |
| 28 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM | int64 | 585 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 29 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | float64 | 7111 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 30 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX | int64 | 123 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 31 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MIN | int64 | 45 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 32 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | float64 | 5112 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 33 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN | float64 | 15471 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 34 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX | float64 | 121 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 35 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN | float64 | 30 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 36 | CREDIT_CARD_BALANCE_SK_DPD_MAX | int64 | 438 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 37 | CREDIT_CARD_BALANCE_SK_DPD_MEAN | float64 | 3945 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 38 | CREDIT_CARD_BALANCE_SK_DPD_SUM | int64 | 1627 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 39 | CREDIT_CARD_BALANCE_SK_DPD_DEF_MAX | int64 | 62 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 40 | CREDIT_CARD_BALANCE_SK_DPD_DEF_MEAN | float64 | 1629 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 41 | CREDIT_CARD_BALANCE_SK_DPD_DEF_SUM | int64 | 229 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 42 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | float64 | 69990 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 43 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX | float64 | 65671 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 44 | CREDIT_CARD_BALANCE_STATUS_Active | int64 | 104 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 45 | CREDIT_CARD_BALANCE_STATUS_Approved | int64 | 2 | [0, 1] | 0 | 0.00 |
| 46 | CREDIT_CARD_BALANCE_STATUS_Completed | int64 | 44 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 47 | CREDIT_CARD_BALANCE_STATUS_Demand | int64 | 21 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 48 | CREDIT_CARD_BALANCE_STATUS_Refused | int64 | 2 | [0, 1] | 0 | 0.00 |
| 49 | CREDIT_CARD_BALANCE_STATUS_Sent proposal | int64 | 2 | [0, 1] | 0 | 0.00 |
| 50 | CREDIT_CARD_BALANCE_STATUS_Signed | int64 | 45 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
Obsługa braków danych¶
Braków danych jest bardzo mało więc decyduję się na zastąpienie ich medianami.
# Zastąpienie braków danych średnią dla kolumny INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN oraz INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX
median_1 = df_credit_card_balance_agg['CREDIT_CARD_BALANCE_AMT_BALANCE_STD'].median()
median_2 = df_credit_card_balance_agg['CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD'].median()
median_3 = df_credit_card_balance_agg['CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD'].median()
median_4 = df_credit_card_balance_agg['CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD'].median()
df_credit_card_balance_agg['CREDIT_CARD_BALANCE_AMT_BALANCE_STD'].fillna(median_1, inplace=True)
df_credit_card_balance_agg['CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD'].fillna(median_2, inplace=True)
df_credit_card_balance_agg['CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD'].fillna(median_3, inplace=True)
df_credit_card_balance_agg['CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD'].fillna(median_4, inplace=True)
# Sprawdzenie usunięcia braków danych: False -> brak jakichkolwiek braków danych
df_installments_payments_agg.isnull().any().any()
False
Połączenie źródeł danych¶
Kiedy mam już odpowiednio zagregowane wszystkie tabele należy je teraz odpowiednio dołączyć do tabeli głównej "application"
Połączenie "df_bik" i "application"¶
df = application.merge(df_bik_final, on='SK_ID_CURR', how='left')
Przegląd danych po połączeniu¶
df.shape
(307511, 99)
Zgadza się, tabela application miała 68 kolumn, tabele df_bik 32 (w tym klucz)
results = []
total_rows = len(df) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df:
unique_values = df[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df[column].dtype # Typ danych
null_count = df[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df[column].dtype == 'object':
values_to_display = df[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_CURR | int64 | 307511 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | TARGET | int64 | 2 | [1, 0] | 0 | 0.00 |
| 2 | NAME_CONTRACT_TYPE | object | 2 | [Cash loans, Revolving loans] | 0 | 0.00 |
| 3 | CODE_GENDER | object | 3 | [M, F, XNA] | 0 | 0.00 |
| 4 | FLAG_OWN_CAR | object | 2 | [N, Y] | 0 | 0.00 |
| 5 | FLAG_OWN_REALTY | object | 2 | [Y, N] | 0 | 0.00 |
| 6 | CNT_CHILDREN | int64 | 15 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | AMT_INCOME_TOTAL | float64 | 2548 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 8 | AMT_CREDIT | float64 | 5603 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 9 | AMT_ANNUITY | float64 | 13672 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 10 | AMT_GOODS_PRICE | float64 | 1002 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 11 | NAME_TYPE_SUITE | object | 8 | [Unaccompanied, Family, Spouse, partner, Children, Other_A, Brak Danych, Other_B, Group of people] | 0 | 0.00 |
| 12 | NAME_INCOME_TYPE | object | 8 | [Working, State servant, Commercial associate, Pensioner, Unemployed, Student, Businessman, Maternity leave] | 0 | 0.00 |
| 13 | NAME_EDUCATION_TYPE | object | 5 | [Secondary / secondary special, Higher education, Incomplete higher, Lower secondary, Academic degree] | 0 | 0.00 |
| 14 | NAME_FAMILY_STATUS | object | 6 | [Single / not married, Married, Civil marriage, Widow, Separated, Unknown] | 0 | 0.00 |
| 15 | NAME_HOUSING_TYPE | object | 6 | [House / apartment, Rented apartment, With parents, Municipal apartment, Office apartment, Co-op apartment] | 0 | 0.00 |
| 16 | REGION_POPULATION_RELATIVE | float64 | 81 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 17 | DAYS_BIRTH | int64 | 17460 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 18 | DAYS_EMPLOYED | int64 | 12574 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 19 | DAYS_REGISTRATION | float64 | 15688 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 20 | DAYS_ID_PUBLISH | int64 | 6168 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 21 | FLAG_MOBIL | int64 | 2 | [1, 0] | 0 | 0.00 |
| 22 | FLAG_EMP_PHONE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 23 | FLAG_WORK_PHONE | int64 | 2 | [0, 1] | 0 | 0.00 |
| 24 | FLAG_CONT_MOBILE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 25 | FLAG_PHONE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 26 | FLAG_EMAIL | int64 | 2 | [0, 1] | 0 | 0.00 |
| 27 | OCCUPATION_TYPE | object | 19 | [Laborers, Core staff, Accountants, Managers, Brak Danych, Drivers, Sales staff, Cleaning staff, Cooking staff, Private service staff, Medicine staff, Security staff, High skill tech staff, Waiters/barmen staff, Low-skill Laborers, Realty agents, Secretaries, IT staff, HR staff] | 0 | 0.00 |
| 28 | CNT_FAM_MEMBERS | float64 | 17 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 29 | REGION_RATING_CLIENT | int64 | 3 | [2, 1, 3] | 0 | 0.00 |
| 30 | REGION_RATING_CLIENT_W_CITY | int64 | 3 | [2, 1, 3] | 0 | 0.00 |
| 31 | WEEKDAY_APPR_PROCESS_START | object | 7 | [WEDNESDAY, MONDAY, THURSDAY, SUNDAY, SATURDAY, FRIDAY, TUESDAY] | 0 | 0.00 |
| 32 | HOUR_APPR_PROCESS_START | int64 | 24 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 33 | REG_REGION_NOT_LIVE_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 34 | REG_REGION_NOT_WORK_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 35 | LIVE_REGION_NOT_WORK_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 36 | REG_CITY_NOT_LIVE_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 37 | REG_CITY_NOT_WORK_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 38 | LIVE_CITY_NOT_WORK_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 39 | ORGANIZATION_TYPE | object | 58 | [Business Entity Type 3, School, Government, Religion, Other, XNA, Electricity, Medicine, Business Entity Type 2, Self-employed, Transport: type 2, Construction, Housing, Kindergarten, Trade: type 7, Industry: type 11, Military, Services, Security Ministries, Transport: type 4, Industry: type 1, Emergency, Security, Trade: type 2, University, Transport: type 3, Police, Business Entity Type 1, Postal, Industry: type 4, Agriculture, Restaurant, Culture, Hotel, Industry: type 7, Trade: type 3, Industry: type 3, Bank, Industry: type 9, Insurance, Trade: type 6, Industry: type 2, Transport: type 1, Industry: type 12, Mobile, Trade: type 1, Industry: type 5, Industry: type 10, Legal Services, Advertising, Trade: type 5, Cleaning, Industry: type 13, Trade: type 4, Telecom, Industry: type 8, Realtor, Industry: type 6] | 0 | 0.00 |
| 40 | EXT_SOURCE_2 | float64 | 119831 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 41 | EXT_SOURCE_3 | float64 | 814 | > 5 unikatowych wartości liczbowych | 60965 | 19.83 |
| 42 | OBS_30_CNT_SOCIAL_CIRCLE | float64 | 33 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 43 | DEF_30_CNT_SOCIAL_CIRCLE | float64 | 10 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 44 | OBS_60_CNT_SOCIAL_CIRCLE | float64 | 33 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 45 | DEF_60_CNT_SOCIAL_CIRCLE | float64 | 9 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 46 | DAYS_LAST_PHONE_CHANGE | float64 | 3773 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 47 | FLAG_DOCUMENT_2 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 48 | FLAG_DOCUMENT_3 | int64 | 2 | [1, 0] | 0 | 0.00 |
| 49 | FLAG_DOCUMENT_4 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 50 | FLAG_DOCUMENT_5 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 51 | FLAG_DOCUMENT_6 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 52 | FLAG_DOCUMENT_7 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 53 | FLAG_DOCUMENT_8 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 54 | FLAG_DOCUMENT_9 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 55 | FLAG_DOCUMENT_10 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 56 | FLAG_DOCUMENT_11 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 57 | FLAG_DOCUMENT_12 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 58 | FLAG_DOCUMENT_13 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 59 | FLAG_DOCUMENT_14 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 60 | FLAG_DOCUMENT_15 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 61 | FLAG_DOCUMENT_16 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 62 | FLAG_DOCUMENT_17 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 63 | FLAG_DOCUMENT_18 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 64 | FLAG_DOCUMENT_19 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 65 | FLAG_DOCUMENT_20 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 66 | FLAG_DOCUMENT_21 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 67 | AMT_REQ_CREDIT_BUREAU_YEAR | float64 | 25 | > 5 unikatowych wartości liczbowych | 41519 | 13.50 |
| 68 | BUREAU_DAYS_CREDIT_MEAN | float64 | 64556 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 69 | BUREAU_DAYS_CREDIT_MAX | float64 | 2923 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 70 | BUREAU_DAYS_CREDIT_MIN | float64 | 2922 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 71 | BUREAU_CREDIT_DAY_OVERDUE_MEAN | float64 | 1541 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 72 | BUREAU_CREDIT_DAY_OVERDUE_SUM | float64 | 879 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 73 | BUREAU_CREDIT_DAY_OVERDUE_MAX | float64 | 868 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 74 | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | float64 | 97904 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 75 | BUREAU_DAYS_CREDIT_ENDDATE_MAX | float64 | 13030 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 76 | BUREAU_DAYS_CREDIT_ENDDATE_MIN | float64 | 6828 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 77 | BUREAU_CNT_CREDIT_PROLONG_MEAN | float64 | 111 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 78 | BUREAU_CNT_CREDIT_PROLONG_SUM | float64 | 10 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 79 | BUREAU_AMT_CREDIT_SUM_MEAN | float64 | 209605 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 80 | BUREAU_AMT_CREDIT_SUM_SUM | float64 | 205644 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 81 | BUREAU_AMT_CREDIT_SUM_DEBT_MEAN | float64 | 169334 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 82 | BUREAU_AMT_CREDIT_SUM_DEBT_SUM | float64 | 155004 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 83 | BUREAU_AMT_CREDIT_SUM_LIMIT_SUM | float64 | 37290 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 84 | BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN | float64 | 1873 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 85 | BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM | float64 | 1218 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 86 | BUREAU_DAYS_CREDIT_UPDATE_MAX | float64 | 2663 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 87 | BUREAU_DAYS_CREDIT_UPDATE_MIN | float64 | 2969 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 88 | BUREAU_MOST_FREQ_CREDIT_ACTIVE | object | 3 | [Closed, nan, Active, Sold] | 44020 | 14.31 |
| 89 | BUREAU_MOST_FREQ_CREDIT_CURRENCY | object | 3 | [currency 1, nan, currency 2, currency 3] | 44020 | 14.31 |
| 90 | BUREAU_MOST_FREQ_CREDIT_TYPE | object | 11 | [Consumer credit, nan, Credit card, Car loan, Microloan, Mortgage, Another type of loan, Loan for business development, Loan for working capital replenishment, Unknown type of loan, Real estate loan, Loan for the purchase of equipment] | 44020 | 14.31 |
| 91 | BUREAU_BALANCE_STATUS_0 | float64 | 451 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 92 | BUREAU_BALANCE_STATUS_1 | float64 | 101 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 93 | BUREAU_BALANCE_STATUS_2 | float64 | 32 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 94 | BUREAU_BALANCE_STATUS_3 | float64 | 24 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 95 | BUREAU_BALANCE_STATUS_4 | float64 | 19 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 96 | BUREAU_BALANCE_STATUS_5 | float64 | 132 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 97 | BUREAU_BALANCE_STATUS_C | float64 | 764 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 98 | BUREAU_BALANCE_STATUS_X | float64 | 536 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
Po połączeniu danych left_joinem do tabeli application, można zaobserwować, że wszystkie kolumny z tabeli BIK mają taką samą liczbę braków danych. Oznacza to, że po prostu brakuje dla nich wartości w tabeli application. Takie braki danych pozostaną i będą przedstawione jako osobne kategorie niosące informacje
Połączenie "application" ze wszystkimi tabelami o poprzednich aplikacjach i produktach¶
Tabela 'previous application' zawiera historie wcześniejszych wniosków kredytowych każdego klienta. Pozostałe 3 tabele: 'POS_CASH_balance', 'instalments_payments' oraz 'credit_card_balance' są połączone zarówno z 'previous_application' przez SK_ID_PREV jak równiez z 'application' przez SK_ID_CURR.dajność Twojego modelu.
W związku z tym decyduje się na połączenie tych 4 tabel od razu do poziomu tabeli 'application'. W ten sposób liczba wierszy w application zostanie taka sama, będzie zawierała dodatkowe informacje zagregowane o poprzednich produktach konkretnych klientów.
df.shape
(307511, 99)
df_previous_application_agg.shape
(338857, 73)
df_POS_CASH_agg.shape
(337252, 26)
df_installments_payments_agg.shape
(339587, 10)
df_credit_card_balance_agg.shape
(103558, 51)
df = df.merge(df_previous_application_agg, on='SK_ID_CURR', how='left')
df = df.merge(df_POS_CASH_agg, on='SK_ID_CURR', how='left')
df = df.merge(df_installments_payments_agg, on='SK_ID_CURR', how='left')
df = df.merge(df_credit_card_balance_agg, on='SK_ID_CURR', how='left')
df.shape
(307511, 255)
Łączie powinienem uzyskać (99 + 68 + 25 + 9 + 50 = 249) 251 kolumn i tak też jest. W związku z tym połaczenie można uznać za poprawne.
Ostateczna tabela df do analizy po połączeniu¶
results = []
total_rows = len(df) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df:
unique_values = df[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df[column].dtype # Typ danych
null_count = df[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df[column].dtype == 'object':
values_to_display = df[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
results_df
| Variable | Data Type | Unique Values Count | Unique Values | Null Values Count | Null Values % | |
|---|---|---|---|---|---|---|
| 0 | SK_ID_CURR | int64 | 307511 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 1 | TARGET | int64 | 2 | [1, 0] | 0 | 0.00 |
| 2 | NAME_CONTRACT_TYPE | object | 2 | [Cash loans, Revolving loans] | 0 | 0.00 |
| 3 | CODE_GENDER | object | 3 | [M, F, XNA] | 0 | 0.00 |
| 4 | FLAG_OWN_CAR | object | 2 | [N, Y] | 0 | 0.00 |
| 5 | FLAG_OWN_REALTY | object | 2 | [Y, N] | 0 | 0.00 |
| 6 | CNT_CHILDREN | int64 | 15 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 7 | AMT_INCOME_TOTAL | float64 | 2548 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 8 | AMT_CREDIT | float64 | 5603 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 9 | AMT_ANNUITY | float64 | 13672 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 10 | AMT_GOODS_PRICE | float64 | 1002 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 11 | NAME_TYPE_SUITE | object | 8 | [Unaccompanied, Family, Spouse, partner, Children, Other_A, Brak Danych, Other_B, Group of people] | 0 | 0.00 |
| 12 | NAME_INCOME_TYPE | object | 8 | [Working, State servant, Commercial associate, Pensioner, Unemployed, Student, Businessman, Maternity leave] | 0 | 0.00 |
| 13 | NAME_EDUCATION_TYPE | object | 5 | [Secondary / secondary special, Higher education, Incomplete higher, Lower secondary, Academic degree] | 0 | 0.00 |
| 14 | NAME_FAMILY_STATUS | object | 6 | [Single / not married, Married, Civil marriage, Widow, Separated, Unknown] | 0 | 0.00 |
| 15 | NAME_HOUSING_TYPE | object | 6 | [House / apartment, Rented apartment, With parents, Municipal apartment, Office apartment, Co-op apartment] | 0 | 0.00 |
| 16 | REGION_POPULATION_RELATIVE | float64 | 81 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 17 | DAYS_BIRTH | int64 | 17460 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 18 | DAYS_EMPLOYED | int64 | 12574 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 19 | DAYS_REGISTRATION | float64 | 15688 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 20 | DAYS_ID_PUBLISH | int64 | 6168 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 21 | FLAG_MOBIL | int64 | 2 | [1, 0] | 0 | 0.00 |
| 22 | FLAG_EMP_PHONE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 23 | FLAG_WORK_PHONE | int64 | 2 | [0, 1] | 0 | 0.00 |
| 24 | FLAG_CONT_MOBILE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 25 | FLAG_PHONE | int64 | 2 | [1, 0] | 0 | 0.00 |
| 26 | FLAG_EMAIL | int64 | 2 | [0, 1] | 0 | 0.00 |
| 27 | OCCUPATION_TYPE | object | 19 | [Laborers, Core staff, Accountants, Managers, Brak Danych, Drivers, Sales staff, Cleaning staff, Cooking staff, Private service staff, Medicine staff, Security staff, High skill tech staff, Waiters/barmen staff, Low-skill Laborers, Realty agents, Secretaries, IT staff, HR staff] | 0 | 0.00 |
| 28 | CNT_FAM_MEMBERS | float64 | 17 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 29 | REGION_RATING_CLIENT | int64 | 3 | [2, 1, 3] | 0 | 0.00 |
| 30 | REGION_RATING_CLIENT_W_CITY | int64 | 3 | [2, 1, 3] | 0 | 0.00 |
| 31 | WEEKDAY_APPR_PROCESS_START | object | 7 | [WEDNESDAY, MONDAY, THURSDAY, SUNDAY, SATURDAY, FRIDAY, TUESDAY] | 0 | 0.00 |
| 32 | HOUR_APPR_PROCESS_START | int64 | 24 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 33 | REG_REGION_NOT_LIVE_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 34 | REG_REGION_NOT_WORK_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 35 | LIVE_REGION_NOT_WORK_REGION | int64 | 2 | [0, 1] | 0 | 0.00 |
| 36 | REG_CITY_NOT_LIVE_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 37 | REG_CITY_NOT_WORK_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 38 | LIVE_CITY_NOT_WORK_CITY | int64 | 2 | [0, 1] | 0 | 0.00 |
| 39 | ORGANIZATION_TYPE | object | 58 | [Business Entity Type 3, School, Government, Religion, Other, XNA, Electricity, Medicine, Business Entity Type 2, Self-employed, Transport: type 2, Construction, Housing, Kindergarten, Trade: type 7, Industry: type 11, Military, Services, Security Ministries, Transport: type 4, Industry: type 1, Emergency, Security, Trade: type 2, University, Transport: type 3, Police, Business Entity Type 1, Postal, Industry: type 4, Agriculture, Restaurant, Culture, Hotel, Industry: type 7, Trade: type 3, Industry: type 3, Bank, Industry: type 9, Insurance, Trade: type 6, Industry: type 2, Transport: type 1, Industry: type 12, Mobile, Trade: type 1, Industry: type 5, Industry: type 10, Legal Services, Advertising, Trade: type 5, Cleaning, Industry: type 13, Trade: type 4, Telecom, Industry: type 8, Realtor, Industry: type 6] | 0 | 0.00 |
| 40 | EXT_SOURCE_2 | float64 | 119831 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 41 | EXT_SOURCE_3 | float64 | 814 | > 5 unikatowych wartości liczbowych | 60965 | 19.83 |
| 42 | OBS_30_CNT_SOCIAL_CIRCLE | float64 | 33 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 43 | DEF_30_CNT_SOCIAL_CIRCLE | float64 | 10 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 44 | OBS_60_CNT_SOCIAL_CIRCLE | float64 | 33 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 45 | DEF_60_CNT_SOCIAL_CIRCLE | float64 | 9 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 46 | DAYS_LAST_PHONE_CHANGE | float64 | 3773 | > 5 unikatowych wartości liczbowych | 0 | 0.00 |
| 47 | FLAG_DOCUMENT_2 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 48 | FLAG_DOCUMENT_3 | int64 | 2 | [1, 0] | 0 | 0.00 |
| 49 | FLAG_DOCUMENT_4 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 50 | FLAG_DOCUMENT_5 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 51 | FLAG_DOCUMENT_6 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 52 | FLAG_DOCUMENT_7 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 53 | FLAG_DOCUMENT_8 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 54 | FLAG_DOCUMENT_9 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 55 | FLAG_DOCUMENT_10 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 56 | FLAG_DOCUMENT_11 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 57 | FLAG_DOCUMENT_12 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 58 | FLAG_DOCUMENT_13 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 59 | FLAG_DOCUMENT_14 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 60 | FLAG_DOCUMENT_15 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 61 | FLAG_DOCUMENT_16 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 62 | FLAG_DOCUMENT_17 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 63 | FLAG_DOCUMENT_18 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 64 | FLAG_DOCUMENT_19 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 65 | FLAG_DOCUMENT_20 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 66 | FLAG_DOCUMENT_21 | int64 | 2 | [0, 1] | 0 | 0.00 |
| 67 | AMT_REQ_CREDIT_BUREAU_YEAR | float64 | 25 | > 5 unikatowych wartości liczbowych | 41519 | 13.50 |
| 68 | BUREAU_DAYS_CREDIT_MEAN | float64 | 64556 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 69 | BUREAU_DAYS_CREDIT_MAX | float64 | 2923 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 70 | BUREAU_DAYS_CREDIT_MIN | float64 | 2922 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 71 | BUREAU_CREDIT_DAY_OVERDUE_MEAN | float64 | 1541 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 72 | BUREAU_CREDIT_DAY_OVERDUE_SUM | float64 | 879 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 73 | BUREAU_CREDIT_DAY_OVERDUE_MAX | float64 | 868 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 74 | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | float64 | 97904 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 75 | BUREAU_DAYS_CREDIT_ENDDATE_MAX | float64 | 13030 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 76 | BUREAU_DAYS_CREDIT_ENDDATE_MIN | float64 | 6828 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 77 | BUREAU_CNT_CREDIT_PROLONG_MEAN | float64 | 111 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 78 | BUREAU_CNT_CREDIT_PROLONG_SUM | float64 | 10 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 79 | BUREAU_AMT_CREDIT_SUM_MEAN | float64 | 209605 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 80 | BUREAU_AMT_CREDIT_SUM_SUM | float64 | 205644 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 81 | BUREAU_AMT_CREDIT_SUM_DEBT_MEAN | float64 | 169334 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 82 | BUREAU_AMT_CREDIT_SUM_DEBT_SUM | float64 | 155004 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 83 | BUREAU_AMT_CREDIT_SUM_LIMIT_SUM | float64 | 37290 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 84 | BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN | float64 | 1873 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 85 | BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM | float64 | 1218 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 86 | BUREAU_DAYS_CREDIT_UPDATE_MAX | float64 | 2663 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 87 | BUREAU_DAYS_CREDIT_UPDATE_MIN | float64 | 2969 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 88 | BUREAU_MOST_FREQ_CREDIT_ACTIVE | object | 3 | [Closed, nan, Active, Sold] | 44020 | 14.31 |
| 89 | BUREAU_MOST_FREQ_CREDIT_CURRENCY | object | 3 | [currency 1, nan, currency 2, currency 3] | 44020 | 14.31 |
| 90 | BUREAU_MOST_FREQ_CREDIT_TYPE | object | 11 | [Consumer credit, nan, Credit card, Car loan, Microloan, Mortgage, Another type of loan, Loan for business development, Loan for working capital replenishment, Unknown type of loan, Real estate loan, Loan for the purchase of equipment] | 44020 | 14.31 |
| 91 | BUREAU_BALANCE_STATUS_0 | float64 | 451 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 92 | BUREAU_BALANCE_STATUS_1 | float64 | 101 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 93 | BUREAU_BALANCE_STATUS_2 | float64 | 32 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 94 | BUREAU_BALANCE_STATUS_3 | float64 | 24 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 95 | BUREAU_BALANCE_STATUS_4 | float64 | 19 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 96 | BUREAU_BALANCE_STATUS_5 | float64 | 132 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 97 | BUREAU_BALANCE_STATUS_C | float64 | 764 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 98 | BUREAU_BALANCE_STATUS_X | float64 | 536 | > 5 unikatowych wartości liczbowych | 44020 | 14.31 |
| 99 | PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN | float64 | 268978 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 100 | PREVIOUS_APPLICATION_AMT_ANNUITY_MAX | float64 | 145950 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 101 | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | float64 | 145162 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 102 | PREVIOUS_APPLICATION_AMT_ANNUITY_SUM | float64 | 268976 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 103 | PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN | float64 | 191770 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 104 | PREVIOUS_APPLICATION_AMT_APPLICATION_MAX | float64 | 48662 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 105 | PREVIOUS_APPLICATION_AMT_APPLICATION_MIN | float64 | 36289 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 106 | PREVIOUS_APPLICATION_AMT_APPLICATION_SUM | float64 | 175562 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 107 | PREVIOUS_APPLICATION_AMT_CREDIT_MEAN | float64 | 210634 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 108 | PREVIOUS_APPLICATION_AMT_CREDIT_MAX | float64 | 58510 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 109 | PREVIOUS_APPLICATION_AMT_CREDIT_MIN | float64 | 38544 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 110 | PREVIOUS_APPLICATION_AMT_CREDIT_SUM | float64 | 194245 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 111 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN | float64 | 185849 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 112 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX | float64 | 48658 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 113 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN | float64 | 47005 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 114 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM | float64 | 175573 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 115 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN | float64 | 2558 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 116 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN | float64 | 24 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 117 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX | float64 | 24 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 118 | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | float64 | 60029 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 119 | PREVIOUS_APPLICATION_DAYS_DECISION_MAX | float64 | 2922 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 120 | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | float64 | 2921 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 121 | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | float64 | 20116 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 122 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | float64 | 250397 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 123 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | float64 | 141287 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 124 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN | float64 | 72935 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 125 | PREVIOUS_APPLICATION_PREV_APPS_COUNT | float64 | 67 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 126 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS | float64 | 59 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 127 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS | float64 | 36 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 128 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | float64 | 27 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 129 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_XNA | float64 | 4 | [0.0, nan, 2.0, 1.0, 3.0] | 16454 | 5.35 |
| 130 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY | float64 | 21 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 131 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY | float64 | 24 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 132 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY | float64 | 20 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 133 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SUNDAY | float64 | 21 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 134 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY | float64 | 23 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 135 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY | float64 | 21 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 136 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY | float64 | 21 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 137 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED | float64 | 26 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 138 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED | float64 | 39 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 139 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | float64 | 46 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 140 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_UNUSED OFFER | float64 | 11 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 141 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | float64 | 18 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 142 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED | float64 | 24 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 143 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER | float64 | 64 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 144 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_XNA | float64 | 10 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 145 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_CLIENT | float64 | 11 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 146 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | float64 | 37 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 147 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | float64 | 22 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 148 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO | float64 | 21 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 149 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR | float64 | 18 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 150 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SYSTEM | float64 | 11 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 151 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_VERIF | float64 | 8 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 152 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP | float64 | 47 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 153 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XNA | float64 | 12 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 154 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA | float64 | 48 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 155 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | float64 | 33 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 156 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_X-SELL | float64 | 34 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 157 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | float64 | 47 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 158 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | float64 | 31 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 159 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION | float64 | 24 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 160 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL | float64 | 26 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 161 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE | float64 | 28 | > 5 unikatowych wartości liczbowych | 16454 | 5.35 |
| 162 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE_X | object | 25 | [XAP, XNA, nan, Other, Repairs, Buying a used car, Buying a holiday home / land, Car repairs, Building a house or an annex, Buying a new car, Urgent needs, Furniture, Education, Medicine, Everyday expenses, Buying a home, Wedding / gift / holiday, Business development, Payments on other loans, Purchase of electronic equipment, Journey, Gasification / water supply, Hobby, Buying a garage, Money for a third person, Refusal to name the goal] | 16454 | 5.35 |
| 163 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE_X | object | 4 | [XNA, Cash through the bank, nan, Non-cash from your account, Cashless from the account of the employer] | 16454 | 5.35 |
| 164 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY_X | object | 26 | [Vehicles, Consumer Electronics, Mobile, XNA, Audio/Video, Furniture, Computers, Construction Materials, nan, Clothing and Accessories, Homewares, Photo / Cinema Equipment, Gardening, Office Appliances, Auto Accessories, Medicine, Jewelry, Weapon, Fitness, Medical Supplies, Tourism, Sport and Leisure, Other, Direct Sales, Education, Insurance, Additional Service] | 16454 | 5.35 |
| 165 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE_X | object | 8 | [Stone, Country-wide, Regional / Local, Credit and cash offices, AP+ (Cash loan), nan, Contact center, Channel of corporate sales, Car dealer] | 16454 | 5.35 |
| 166 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE_Y | object | 25 | [XAP, XNA, nan, Other, Repairs, Buying a used car, Buying a holiday home / land, Car repairs, Building a house or an annex, Buying a new car, Urgent needs, Furniture, Education, Medicine, Everyday expenses, Buying a home, Wedding / gift / holiday, Business development, Payments on other loans, Purchase of electronic equipment, Journey, Gasification / water supply, Hobby, Buying a garage, Money for a third person, Refusal to name the goal] | 16454 | 5.35 |
| 167 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE_Y | object | 4 | [XNA, Cash through the bank, nan, Non-cash from your account, Cashless from the account of the employer] | 16454 | 5.35 |
| 168 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY_Y | object | 26 | [Vehicles, Consumer Electronics, Mobile, XNA, Audio/Video, Furniture, Computers, Construction Materials, nan, Clothing and Accessories, Homewares, Photo / Cinema Equipment, Gardening, Office Appliances, Auto Accessories, Medicine, Jewelry, Weapon, Fitness, Medical Supplies, Tourism, Sport and Leisure, Other, Direct Sales, Education, Insurance, Additional Service] | 16454 | 5.35 |
| 169 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE_Y | object | 8 | [Stone, Country-wide, Regional / Local, Credit and cash offices, AP+ (Cash loan), nan, Contact center, Channel of corporate sales, Car dealer] | 16454 | 5.35 |
| 170 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | object | 17 | [POS other with interest, Cash X-Sell: low, POS mobile without interest, Cash, Cash X-Sell: middle, POS household with interest, POS industry without interest, Card X-Sell, Cash X-Sell: high, POS household without interest, POS industry with interest, Card Street, nan, POS mobile with interest, Cash Street: high, Cash Street: low, Cash Street: middle, POS others without interest] | 16454 | 5.35 |
| 171 | POS_CASH_MONTHS_BALANCE_MIN | float64 | 96 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 172 | POS_CASH_MONTHS_BALANCE_MAX | float64 | 96 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 173 | POS_CASH_MONTHS_BALANCE_MEAN | float64 | 63032 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 174 | POS_CASH_CNT_INSTALMENT_MEAN | float64 | 41222 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 175 | POS_CASH_CNT_INSTALMENT_SUM | float64 | 4194 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 176 | POS_CASH_CNT_INSTALMENT_MAX | float64 | 61 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 177 | POS_CASH_CNT_INSTALMENT_MIN | float64 | 57 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 178 | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | float64 | 39841 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 179 | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | float64 | 2892 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 180 | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | float64 | 62 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 181 | POS_CASH_CNT_INSTALMENT_FUTURE_MIN | float64 | 61 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 182 | POS_CASH_SK_DPD_MAX | float64 | 1919 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 183 | POS_CASH_SK_DPD_MEAN | float64 | 10768 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 184 | POS_CASH_SK_DPD_DEF_MAX | float64 | 200 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 185 | POS_CASH_SK_DPD_DEF_MEAN | float64 | 4471 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 186 | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | float64 | 212 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 187 | POS_CASH_NAME_CONTRACT_STATUS_AMORTIZED DEBT | float64 | 9 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 188 | POS_CASH_NAME_CONTRACT_STATUS_APPROVED | float64 | 12 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 189 | POS_CASH_NAME_CONTRACT_STATUS_CANCELED | float64 | 2 | [0.0, nan, 1.0] | 18067 | 5.88 |
| 190 | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | float64 | 47 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 191 | POS_CASH_NAME_CONTRACT_STATUS_DEMAND | float64 | 53 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 192 | POS_CASH_NAME_CONTRACT_STATUS_RETURNED TO THE STORE | float64 | 5 | [0.0, 1.0, nan, 2.0, 3.0, 4.0] | 18067 | 5.88 |
| 193 | POS_CASH_NAME_CONTRACT_STATUS_SIGNED | float64 | 32 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 194 | POS_CASH_NAME_CONTRACT_STATUS_XNA | float64 | 2 | [0.0, nan, 1.0] | 18067 | 5.88 |
| 195 | POS_CASH_APP_COUNT | float64 | 25 | > 5 unikatowych wartości liczbowych | 18067 | 5.88 |
| 196 | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | float64 | 102 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 197 | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | float64 | 4972 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 198 | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | float64 | 49 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 199 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | float64 | 273533 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 200 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | float64 | 263111 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 201 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | float64 | 138000 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 202 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | float64 | 128308 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 203 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | float64 | 104955 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 204 | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | float64 | 5654 | > 5 unikatowych wartości liczbowych | 15868 | 5.16 |
| 205 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN | float64 | 96 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 206 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX | float64 | 5 | [nan, -1.0, -2.0, -4.0, -3.0, -5.0] | 220606 | 71.74 |
| 207 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN | float64 | 459 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 208 | CREDIT_CARD_BALANCE_AMT_BALANCE_SUM | float64 | 58811 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 209 | CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN | float64 | 59467 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 210 | CREDIT_CARD_BALANCE_AMT_BALANCE_MAX | float64 | 56722 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 211 | CREDIT_CARD_BALANCE_AMT_BALANCE_MIN | float64 | 11608 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 212 | CREDIT_CARD_BALANCE_AMT_BALANCE_STD | float64 | 59348 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 213 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM | float64 | 2440 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 214 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MEAN | float64 | 11313 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 215 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MAX | float64 | 54 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 216 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MIN | float64 | 161 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 217 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD | float64 | 21993 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 218 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM | float64 | 40377 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 219 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN | float64 | 49079 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 220 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX | float64 | 24117 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 221 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MIN | float64 | 2066 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 222 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD | float64 | 55773 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 223 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM | float64 | 52950 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 224 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN | float64 | 57833 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 225 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX | float64 | 30350 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 226 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MIN | float64 | 1507 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 227 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM | float64 | 58868 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 228 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN | float64 | 59521 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 229 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX | float64 | 56422 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 230 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN | float64 | 19918 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 231 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD | float64 | 59437 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 232 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM | float64 | 548 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 233 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | float64 | 6584 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 234 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX | float64 | 119 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 235 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MIN | float64 | 43 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 236 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | float64 | 4925 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 237 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN | float64 | 13923 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 238 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX | float64 | 120 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 239 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN | float64 | 28 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 240 | CREDIT_CARD_BALANCE_SK_DPD_MAX | float64 | 409 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 241 | CREDIT_CARD_BALANCE_SK_DPD_MEAN | float64 | 3606 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 242 | CREDIT_CARD_BALANCE_SK_DPD_SUM | float64 | 1464 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 243 | CREDIT_CARD_BALANCE_SK_DPD_DEF_MAX | float64 | 52 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 244 | CREDIT_CARD_BALANCE_SK_DPD_DEF_MEAN | float64 | 1548 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 245 | CREDIT_CARD_BALANCE_SK_DPD_DEF_SUM | float64 | 217 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 246 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | float64 | 59391 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 247 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX | float64 | 56150 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 248 | CREDIT_CARD_BALANCE_STATUS_Active | float64 | 103 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 249 | CREDIT_CARD_BALANCE_STATUS_Approved | float64 | 2 | [nan, 0.0, 1.0] | 220606 | 71.74 |
| 250 | CREDIT_CARD_BALANCE_STATUS_Completed | float64 | 40 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 251 | CREDIT_CARD_BALANCE_STATUS_Demand | float64 | 14 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
| 252 | CREDIT_CARD_BALANCE_STATUS_Refused | float64 | 2 | [nan, 0.0, 1.0] | 220606 | 71.74 |
| 253 | CREDIT_CARD_BALANCE_STATUS_Sent proposal | float64 | 2 | [nan, 0.0, 1.0] | 220606 | 71.74 |
| 254 | CREDIT_CARD_BALANCE_STATUS_Signed | float64 | 43 | > 5 unikatowych wartości liczbowych | 220606 | 71.74 |
* Można zauważyć, że dla każdej kolumny dodanej z konkretnej tabeli występuje tyle samo braków danych. Wynika to z faktu, że po prostu takie dane nie występują dla jakiegoś % aktualnych ID wniosków kredytowych.
* W przypadku takich braków danych najlepiej jest po prostu zachować informację o braku danych, ale jest to tylko możliwe dla zmiennych kategorycznych.
* Zmienne ilościowe trzeba albo zaimputować, albo odpowiednio skategoryzować, nadając kolejno grupę z brakami danych. Braki danych dla każdej z dodatkowych tabel mają przynajmniej 5%, więc można byłoby utworzyć takie kategorie dla każdej ze zmiennych.
Obsługa braków danych dla kolumn jakościowych¶
categorical_columns = df.select_dtypes(exclude=[np.number]).columns
# Obsługa braków danych dla zmiennych jakościowych
df[categorical_columns] = df[categorical_columns].fillna('Brak danych')
Przygotowanie danych cz.II¶
df.shape
(307511, 255)
# df.to_csv('df.csv', index=False) # Zapisanie całego zbioru do csv, aby móc zrobić obliczeniania w programie statistica
Podział na zbiór uczący i testowy¶
Dokonuję podziału na zbiór uczący i testowy już teraz, przed selekcją zmiennych i dyskretyzacją tak, żeby nie przeuczyć modelu. Wszystkie działania powinny odbywać się wyłącznie na zbiorze uczącym, a następnie powinniśmy dostosować zbiór testowy do uczącego, np. pod względem przyjętych zmiennych.
Podziału na zbiór uczący i testowy dokonam w programie Statistica. Mój zbiór był niezbilansowany więc wykonałem technikę oversampling na zbiorze uczącym.
# Sprytny sposób jak wykryć format niestandardowego kodowania pliku csv
#pip install chardet
import chardet
# Sprawdzenie kodowania pliku
with open('df_proby.csv', 'rb') as file:
result = chardet.detect(file.read(10000)) # Czytaj tylko część pliku, aby przyspieszyć operację
encoding = result['encoding']
print("Wykryte kodowanie:", encoding)
Wykryte kodowanie: ISO-8859-1
df_all = pd.read_csv('df_proby.csv', encoding='ISO-8859-1', sep=';', decimal=',')
df_all.shape
(513897, 252)
df_all.dtypes
SK_ID_CURR int64 TARGET int64 NAME_CONTRACT_TYPE object CODE_GENDER object FLAG_OWN_CAR object FLAG_OWN_REALTY object CNT_CHILDREN int64 AMT_INCOME_TOTAL float64 AMT_CREDIT float64 AMT_ANNUITY float64 AMT_GOODS_PRICE float64 NAME_TYPE_SUITE object NAME_INCOME_TYPE object NAME_EDUCATION_TYPE object NAME_FAMILY_STATUS object NAME_HOUSING_TYPE object REGION_POPULATION_RELATIVE float64 DAYS_BIRTH int64 DAYS_EMPLOYED int64 DAYS_REGISTRATION float64 DAYS_ID_PUBLISH int64 FLAG_MOBIL int64 FLAG_EMP_PHONE int64 FLAG_WORK_PHONE int64 FLAG_CONT_MOBILE int64 FLAG_PHONE int64 FLAG_EMAIL int64 OCCUPATION_TYPE object CNT_FAM_MEMBERS int64 REGION_RATING_CLIENT int64 REGION_RATING_CLIENT_W_CITY int64 WEEKDAY_APPR_PROCESS_START object HOUR_APPR_PROCESS_START int64 REG_REGION_NOT_LIVE_REGION int64 REG_REGION_NOT_WORK_REGION int64 LIVE_REGION_NOT_WORK_REGION int64 REG_CITY_NOT_LIVE_CITY int64 REG_CITY_NOT_WORK_CITY int64 LIVE_CITY_NOT_WORK_CITY int64 ORGANIZATION_TYPE object EXT_SOURCE_2 float64 EXT_SOURCE_3 float64 OBS_30_CNT_SOCIAL_CIRCLE int64 DEF_30_CNT_SOCIAL_CIRCLE int64 OBS_60_CNT_SOCIAL_CIRCLE int64 DEF_60_CNT_SOCIAL_CIRCLE int64 DAYS_LAST_PHONE_CHANGE int64 FLAG_DOCUMENT_2 int64 FLAG_DOCUMENT_3 int64 FLAG_DOCUMENT_4 int64 FLAG_DOCUMENT_5 int64 FLAG_DOCUMENT_6 int64 FLAG_DOCUMENT_7 int64 FLAG_DOCUMENT_8 int64 FLAG_DOCUMENT_9 int64 FLAG_DOCUMENT_10 int64 FLAG_DOCUMENT_11 int64 FLAG_DOCUMENT_12 int64 FLAG_DOCUMENT_13 int64 FLAG_DOCUMENT_14 int64 FLAG_DOCUMENT_15 int64 FLAG_DOCUMENT_16 int64 FLAG_DOCUMENT_17 int64 FLAG_DOCUMENT_18 int64 FLAG_DOCUMENT_19 int64 FLAG_DOCUMENT_20 int64 FLAG_DOCUMENT_21 int64 AMT_REQ_CREDIT_BUREAU_YEAR float64 BUREAU_DAYS_CREDIT_MEAN float64 BUREAU_DAYS_CREDIT_MAX float64 BUREAU_DAYS_CREDIT_MIN float64 BUREAU_CREDIT_DAY_OVERDUE_MEAN float64 BUREAU_CREDIT_DAY_OVERDUE_SUM float64 BUREAU_CREDIT_DAY_OVERDUE_MAX float64 BUREAU_DAYS_CREDIT_ENDDATE_MEAN float64 BUREAU_DAYS_CREDIT_ENDDATE_MAX float64 BUREAU_DAYS_CREDIT_ENDDATE_MIN float64 BUREAU_CNT_CREDIT_PROLONG_MEAN float64 BUREAU_CNT_CREDIT_PROLONG_SUM float64 BUREAU_AMT_CREDIT_SUM_MEAN float64 BUREAU_AMT_CREDIT_SUM_SUM float64 BUREAU_AMT_CREDIT_SUM_DEBT_MEAN float64 BUREAU_AMT_CREDIT_SUM_DEBT_SUM float64 BUREAU_AMT_CREDIT_SUM_LIMIT_SUM float64 BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN float64 BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM float64 BUREAU_DAYS_CREDIT_UPDATE_MAX float64 BUREAU_DAYS_CREDIT_UPDATE_MIN float64 BUREAU_MOST_FREQ_CREDIT_ACTIVE object BUREAU_MOST_FREQ_CREDIT_CURRENCY object BUREAU_MOST_FREQ_CREDIT_TYPE object BUREAU_BALANCE_STATUS_0 float64 BUREAU_BALANCE_STATUS_1 float64 BUREAU_BALANCE_STATUS_2 float64 BUREAU_BALANCE_STATUS_3 float64 BUREAU_BALANCE_STATUS_4 float64 BUREAU_BALANCE_STATUS_5 float64 BUREAU_BALANCE_STATUS_C float64 BUREAU_BALANCE_STATUS_X float64 PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN float64 PREVIOUS_APPLICATION_AMT_ANNUITY_MAX float64 PREVIOUS_APPLICATION_AMT_ANNUITY_MIN float64 PREVIOUS_APPLICATION_AMT_ANNUITY_SUM float64 PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN float64 PREVIOUS_APPLICATION_AMT_APPLICATION_MAX float64 PREVIOUS_APPLICATION_AMT_APPLICATION_MIN float64 PREVIOUS_APPLICATION_AMT_APPLICATION_SUM float64 PREVIOUS_APPLICATION_AMT_CREDIT_MEAN float64 PREVIOUS_APPLICATION_AMT_CREDIT_MAX float64 PREVIOUS_APPLICATION_AMT_CREDIT_MIN float64 PREVIOUS_APPLICATION_AMT_CREDIT_SUM float64 PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN float64 PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX float64 PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN float64 PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM float64 PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN float64 PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN float64 PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX float64 PREVIOUS_APPLICATION_DAYS_DECISION_MEAN float64 PREVIOUS_APPLICATION_DAYS_DECISION_MAX float64 PREVIOUS_APPLICATION_DAYS_DECISION_MIN float64 PREVIOUS_APPLICATION_DAYS_DECISION_SUM float64 PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN float64 PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX float64 PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN float64 PREVIOUS_APPLICATION_PREV_APPS_COUNT float64 PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS float64 PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS float64 PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS float64 PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_XNA float64 PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY float64 PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY float64 PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY float64 PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SUNDAY float64 PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY float64 PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY float64 PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY float64 PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED float64 PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED float64 PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED float64 PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_UNUSED OFFER float64 PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW float64 PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED float64 PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER float64 PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_XNA float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_CLIENT float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_SYSTEM float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_VERIF float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP float64 PREVIOUS_APPLICATION_CODE_REJECT_REASON_XNA float64 PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA float64 PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN float64 PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_X-SELL float64 PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA float64 PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH float64 PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION float64 PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL float64 PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE float64 PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE object PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE object PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY object PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE object PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION object POS_CASH_MONTHS_BALANCE_MIN float64 POS_CASH_MONTHS_BALANCE_MAX float64 POS_CASH_MONTHS_BALANCE_MEAN float64 POS_CASH_CNT_INSTALMENT_MEAN float64 POS_CASH_CNT_INSTALMENT_SUM float64 POS_CASH_CNT_INSTALMENT_MAX float64 POS_CASH_CNT_INSTALMENT_MIN float64 POS_CASH_CNT_INSTALMENT_FUTURE_MEAN float64 POS_CASH_CNT_INSTALMENT_FUTURE_SUM float64 POS_CASH_CNT_INSTALMENT_FUTURE_MAX float64 POS_CASH_CNT_INSTALMENT_FUTURE_MIN float64 POS_CASH_SK_DPD_MAX float64 POS_CASH_SK_DPD_MEAN float64 POS_CASH_SK_DPD_DEF_MAX float64 POS_CASH_SK_DPD_DEF_MEAN float64 POS_CASH_NAME_CONTRACT_STATUS_ACTIVE float64 POS_CASH_NAME_CONTRACT_STATUS_AMORTIZED DEBT float64 POS_CASH_NAME_CONTRACT_STATUS_APPROVED float64 POS_CASH_NAME_CONTRACT_STATUS_CANCELED float64 POS_CASH_NAME_CONTRACT_STATUS_COMPLETED float64 POS_CASH_NAME_CONTRACT_STATUS_DEMAND float64 POS_CASH_NAME_CONTRACT_STATUS_RETURNED TO THE STORE float64 POS_CASH_NAME_CONTRACT_STATUS_SIGNED float64 POS_CASH_NAME_CONTRACT_STATUS_XNA float64 POS_CASH_APP_COUNT float64 INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM float64 INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN float64 INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE float64 INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN float64 INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX float64 INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN float64 INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM float64 INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX float64 INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED float64 CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN float64 CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX float64 CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN float64 CREDIT_CARD_BALANCE_AMT_BALANCE_SUM float64 CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN float64 CREDIT_CARD_BALANCE_AMT_BALANCE_MAX float64 CREDIT_CARD_BALANCE_AMT_BALANCE_MIN float64 CREDIT_CARD_BALANCE_AMT_BALANCE_STD float64 CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM float64 CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MEAN float64 CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MAX float64 CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MIN float64 CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD float64 CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM float64 CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN float64 CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX float64 CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MIN float64 CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD float64 CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM float64 CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN float64 CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX float64 CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MIN float64 CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM float64 CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN float64 CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX float64 CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN float64 CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD float64 CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM float64 CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN float64 CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX float64 CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MIN float64 CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM float64 CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN float64 CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX float64 CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN float64 CREDIT_CARD_BALANCE_SK_DPD_MAX float64 CREDIT_CARD_BALANCE_SK_DPD_MEAN float64 CREDIT_CARD_BALANCE_SK_DPD_SUM float64 CREDIT_CARD_BALANCE_SK_DPD_DEF_MAX float64 CREDIT_CARD_BALANCE_SK_DPD_DEF_MEAN float64 CREDIT_CARD_BALANCE_SK_DPD_DEF_SUM float64 CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN float64 CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX float64 CREDIT_CARD_BALANCE_STATUS_Active float64 CREDIT_CARD_BALANCE_STATUS_Approved float64 CREDIT_CARD_BALANCE_STATUS_Completed float64 CREDIT_CARD_BALANCE_STATUS_Demand float64 CREDIT_CARD_BALANCE_STATUS_Refused float64 CREDIT_CARD_BALANCE_STATUS_Sent proposal float64 CREDIT_CARD_BALANCE_STATUS_Signed float64 Próba object dtype: object
df_all.head(5)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_YEAR | BUREAU_DAYS_CREDIT_MEAN | BUREAU_DAYS_CREDIT_MAX | BUREAU_DAYS_CREDIT_MIN | BUREAU_CREDIT_DAY_OVERDUE_MEAN | BUREAU_CREDIT_DAY_OVERDUE_SUM | BUREAU_CREDIT_DAY_OVERDUE_MAX | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | BUREAU_DAYS_CREDIT_ENDDATE_MAX | BUREAU_DAYS_CREDIT_ENDDATE_MIN | BUREAU_CNT_CREDIT_PROLONG_MEAN | BUREAU_CNT_CREDIT_PROLONG_SUM | BUREAU_AMT_CREDIT_SUM_MEAN | BUREAU_AMT_CREDIT_SUM_SUM | BUREAU_AMT_CREDIT_SUM_DEBT_MEAN | BUREAU_AMT_CREDIT_SUM_DEBT_SUM | BUREAU_AMT_CREDIT_SUM_LIMIT_SUM | BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN | BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM | BUREAU_DAYS_CREDIT_UPDATE_MAX | BUREAU_DAYS_CREDIT_UPDATE_MIN | BUREAU_MOST_FREQ_CREDIT_ACTIVE | BUREAU_MOST_FREQ_CREDIT_CURRENCY | BUREAU_MOST_FREQ_CREDIT_TYPE | BUREAU_BALANCE_STATUS_0 | BUREAU_BALANCE_STATUS_1 | BUREAU_BALANCE_STATUS_2 | BUREAU_BALANCE_STATUS_3 | BUREAU_BALANCE_STATUS_4 | BUREAU_BALANCE_STATUS_5 | BUREAU_BALANCE_STATUS_C | BUREAU_BALANCE_STATUS_X | PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN | PREVIOUS_APPLICATION_AMT_ANNUITY_MAX | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | PREVIOUS_APPLICATION_AMT_ANNUITY_SUM | PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN | PREVIOUS_APPLICATION_AMT_APPLICATION_MAX | PREVIOUS_APPLICATION_AMT_APPLICATION_MIN | PREVIOUS_APPLICATION_AMT_APPLICATION_SUM | PREVIOUS_APPLICATION_AMT_CREDIT_MEAN | PREVIOUS_APPLICATION_AMT_CREDIT_MAX | PREVIOUS_APPLICATION_AMT_CREDIT_MIN | PREVIOUS_APPLICATION_AMT_CREDIT_SUM | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | PREVIOUS_APPLICATION_DAYS_DECISION_MAX | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN | PREVIOUS_APPLICATION_PREV_APPS_COUNT | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_XNA | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SUNDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_UNUSED OFFER | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_XNA | PREVIOUS_APPLICATION_CODE_REJECT_REASON_CLIENT | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SYSTEM | PREVIOUS_APPLICATION_CODE_REJECT_REASON_VERIF | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XNA | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_X-SELL | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | POS_CASH_MONTHS_BALANCE_MIN | POS_CASH_MONTHS_BALANCE_MAX | POS_CASH_MONTHS_BALANCE_MEAN | POS_CASH_CNT_INSTALMENT_MEAN | POS_CASH_CNT_INSTALMENT_SUM | POS_CASH_CNT_INSTALMENT_MAX | POS_CASH_CNT_INSTALMENT_MIN | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | POS_CASH_CNT_INSTALMENT_FUTURE_MIN | POS_CASH_SK_DPD_MAX | POS_CASH_SK_DPD_MEAN | POS_CASH_SK_DPD_DEF_MAX | POS_CASH_SK_DPD_DEF_MEAN | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | POS_CASH_NAME_CONTRACT_STATUS_AMORTIZED DEBT | POS_CASH_NAME_CONTRACT_STATUS_APPROVED | POS_CASH_NAME_CONTRACT_STATUS_CANCELED | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | POS_CASH_NAME_CONTRACT_STATUS_DEMAND | POS_CASH_NAME_CONTRACT_STATUS_RETURNED TO THE STORE | POS_CASH_NAME_CONTRACT_STATUS_SIGNED | POS_CASH_NAME_CONTRACT_STATUS_XNA | POS_CASH_APP_COUNT | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN | CREDIT_CARD_BALANCE_AMT_BALANCE_SUM | CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN | CREDIT_CARD_BALANCE_AMT_BALANCE_MAX | CREDIT_CARD_BALANCE_AMT_BALANCE_MIN | CREDIT_CARD_BALANCE_AMT_BALANCE_STD | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MEAN | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MAX | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MIN | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MIN | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MIN | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MIN | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN | CREDIT_CARD_BALANCE_SK_DPD_MAX | CREDIT_CARD_BALANCE_SK_DPD_MEAN | CREDIT_CARD_BALANCE_SK_DPD_SUM | CREDIT_CARD_BALANCE_SK_DPD_DEF_MAX | CREDIT_CARD_BALANCE_SK_DPD_DEF_MEAN | CREDIT_CARD_BALANCE_SK_DPD_DEF_SUM | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX | CREDIT_CARD_BALANCE_STATUS_Active | CREDIT_CARD_BALANCE_STATUS_Approved | CREDIT_CARD_BALANCE_STATUS_Completed | CREDIT_CARD_BALANCE_STATUS_Demand | CREDIT_CARD_BALANCE_STATUS_Refused | CREDIT_CARD_BALANCE_STATUS_Sent proposal | CREDIT_CARD_BALANCE_STATUS_Signed | Próba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 113281 | 0 | Cash loans | F | N | Y | 1 | 76500.00 | 112500.00 | 11254.50 | 112500.00 | Family | Working | Secondary / secondary special | Married | House / apartment | 0.02 | -14669 | -4882 | -6152.00 | -4731 | 1 | 1 | 1 | 1 | 0 | 0 | Security staff | 3 | 2 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.18 | 0.68 | 5 | 1 | 5 | 0 | -228 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.00 | -1317.00 | -151.00 | -2904.00 | 0.00 | 0.00 | 0.00 | 325.33 | 1449.00 | -1078.00 | 0.00 | 0.00 | 395797.50 | 2374785.00 | 113368.50 | 566842.50 | 0.00 | 0.00 | 0.00 | -25.00 | -518.00 | Active | currency 1 | Consumer credit | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7868.64 | 10207.17 | 5099.58 | 23605.92 | 67940.04 | 77161.50 | 62803.62 | 203820.12 | 65641.50 | 84465.00 | 50247.00 | 196924.50 | 67940.04 | 77161.50 | 62803.62 | 203820.12 | 11.67 | 10.00 | 13.00 | -1232.67 | -228.00 | -2657.00 | -3698.00 | 1.06 | 1.25 | 0.91 | 15.00 | 0.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 1.00 | 0.00 | 3.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | 0.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | XAP | Cash through the bank | Mobile | Country-wide | POS mobile with interest | -87.00 | -3.00 | -38.59 | 9.88 | 316.00 | 12.00 | 4.00 | 4.81 | 154.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 28.00 | 0.00 | 0.00 | 0.00 | 3.00 | 0.00 | 0.00 | 1.00 | 0.00 | 3.00 | 1.00 | 0.04 | 2.00 | 2661.37 | 15895.28 | 0.00 | 0.00 | 0.00 | 10.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Uczenie |
| 1 | 249784 | 0 | Cash loans | M | N | Y | 4 | 58500.00 | 96696.00 | 6426.00 | 76500.00 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.01 | -17875 | -432 | -8753.00 | -1427 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 6 | 2 | 2 | THURSDAY | 18 | 0 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.38 | 0.83 | 0 | 0 | 0 | 0 | -2028 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.00 | -2914.00 | -2914.00 | -2914.00 | 0.00 | 0.00 | 0.00 | -2549.00 | -2549.00 | -2549.00 | 0.00 | 0.00 | 45000.00 | 45000.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -2546.00 | -2546.00 | Closed | currency 1 | Consumer credit | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8729.55 | 8729.55 | 8729.55 | 8729.55 | 144202.50 | 144202.50 | 144202.50 | 144202.50 | 168948.00 | 168948.00 | 168948.00 | 168948.00 | 144202.50 | 144202.50 | 144202.50 | 144202.50 | 20.00 | 20.00 | 20.00 | -708.00 | -708.00 | -708.00 | -708.00 | 0.85 | 0.85 | 0.85 | 4.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | XAP | Cash through the bank | Computers | Regional / Local | POS household with interest | -24.00 | -2.00 | -13.00 | 24.00 | 552.00 | 24.00 | 24.00 | 13.00 | 299.00 | 24.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 23.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Uczenie |
| 2 | 405688 | 0 | Cash loans | F | Y | Y | 0 | 225000.00 | 675000.00 | 32602.50 | 675000.00 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.02 | -10648 | -1392 | -1208.00 | -980 | 1 | 1 | 0 | 1 | 0 | 0 | Managers | 2 | 3 | 3 | THURSDAY | 13 | 0 | 0 | 0 | 0 | 0 | 0 | Transport: type 4 | 0.54 | 0.77 | 0 | 0 | 0 | 0 | -574 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.00 | -1110.00 | -566.00 | -1676.00 | 0.00 | 0.00 | 0.00 | 816.75 | 4274.00 | -427.00 | 0.00 | 0.00 | 1472400.00 | 7362000.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -149.00 | -485.00 | Closed | currency 1 | Consumer credit | 163.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 17.00 | 5.00 | 24996.96 | 24996.96 | 24996.96 | 24996.96 | 675000.00 | 675000.00 | 675000.00 | 675000.00 | 782622.00 | 782622.00 | 782622.00 | 782622.00 | 675000.00 | 675000.00 | 675000.00 | 675000.00 | 9.00 | 9.00 | 9.00 | -574.00 | -574.00 | -574.00 | -574.00 | 0.86 | 0.86 | 0.86 | 9.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | Repairs | XNA | XNA | Credit and cash offices | Cash Street: low | -19.00 | -13.00 | -16.00 | 35.71 | 250.00 | 48.00 | 5.00 | 32.86 | 230.00 | 48.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 6.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 | 326466.85 | 326466.85 | 0.00 | 0.00 | 0.00 | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Uczenie |
| 3 | 397763 | 0 | Revolving loans | F | N | Y | 0 | 135000.00 | 405000.00 | 20250.00 | 405000.00 | Unaccompanied | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | 0.03 | -19244 | -1310 | -8562.00 | -1574 | 1 | 1 | 0 | 1 | 0 | 0 | Brak Danych | 1 | 2 | 2 | WEDNESDAY | 13 | 0 | 0 | 0 | 0 | 0 | 0 | Housing | 0.23 | NaN | 7 | 1 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.00 | -953.50 | -506.00 | -1401.00 | 0.00 | 0.00 | 0.00 | 173.50 | 1307.00 | -1097.00 | 0.00 | 0.00 | 88823.25 | 355293.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -506.00 | -1124.00 | Active | currency 1 | Credit card | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 30667.32 | 30667.32 | 30667.32 | 30667.32 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 467775.00 | 15.00 | 15.00 | 15.00 | -236.00 | -236.00 | -236.00 | -236.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | XAP | Cash through the bank | Clothing and Accessories | Stone | POS industry with interest | -8.00 | -2.00 | -5.00 | 18.00 | 126.00 | 18.00 | 18.00 | 15.00 | 105.00 | 18.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 3.00 | 140680.40 | 140680.40 | -39375.00 | -315000.00 | 0.00 | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Uczenie |
| 4 | 319459 | 0 | Revolving loans | M | Y | Y | 0 | 405000.00 | 675000.00 | 33750.00 | 675000.00 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.07 | -20437 | -954 | -12796.00 | -1119 | 1 | 1 | 0 | 1 | 1 | 0 | Managers | 2 | 1 | 1 | TUESDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.60 | 0.41 | 0 | 0 | 0 | 0 | -257 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.00 | -257.00 | -257.00 | -257.00 | 0.00 | 0.00 | 0.00 | 111.00 | 111.00 | 111.00 | 0.00 | 0.00 | 514003.50 | 514003.50 | 200133.00 | 200133.00 | 0.00 | 0.00 | 0.00 | -2.00 | -2.00 | Active | currency 1 | Consumer credit | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 79722.40 | 106296.39 | 53148.42 | 159444.81 | 771009.75 | 1028011.50 | 514008.00 | 1542019.50 | 771009.75 | 1028011.50 | 514008.00 | 1542019.50 | 771009.75 | 1028011.50 | 514008.00 | 1542019.50 | 16.50 | 16.00 | 17.00 | -257.00 | -257.00 | -257.00 | -514.00 | 1.00 | 1.00 | 1.00 | 9.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | XAP | Cash through the bank | Furniture | Country-wide | POS industry with interest | -9.00 | -2.00 | -5.50 | 12.00 | 96.00 | 12.00 | 12.00 | 8.50 | 68.00 | 12.00 | 5.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Uczenie |
df_all.tail(5)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_YEAR | BUREAU_DAYS_CREDIT_MEAN | BUREAU_DAYS_CREDIT_MAX | BUREAU_DAYS_CREDIT_MIN | BUREAU_CREDIT_DAY_OVERDUE_MEAN | BUREAU_CREDIT_DAY_OVERDUE_SUM | BUREAU_CREDIT_DAY_OVERDUE_MAX | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | BUREAU_DAYS_CREDIT_ENDDATE_MAX | BUREAU_DAYS_CREDIT_ENDDATE_MIN | BUREAU_CNT_CREDIT_PROLONG_MEAN | BUREAU_CNT_CREDIT_PROLONG_SUM | BUREAU_AMT_CREDIT_SUM_MEAN | BUREAU_AMT_CREDIT_SUM_SUM | BUREAU_AMT_CREDIT_SUM_DEBT_MEAN | BUREAU_AMT_CREDIT_SUM_DEBT_SUM | BUREAU_AMT_CREDIT_SUM_LIMIT_SUM | BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN | BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM | BUREAU_DAYS_CREDIT_UPDATE_MAX | BUREAU_DAYS_CREDIT_UPDATE_MIN | BUREAU_MOST_FREQ_CREDIT_ACTIVE | BUREAU_MOST_FREQ_CREDIT_CURRENCY | BUREAU_MOST_FREQ_CREDIT_TYPE | BUREAU_BALANCE_STATUS_0 | BUREAU_BALANCE_STATUS_1 | BUREAU_BALANCE_STATUS_2 | BUREAU_BALANCE_STATUS_3 | BUREAU_BALANCE_STATUS_4 | BUREAU_BALANCE_STATUS_5 | BUREAU_BALANCE_STATUS_C | BUREAU_BALANCE_STATUS_X | PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN | PREVIOUS_APPLICATION_AMT_ANNUITY_MAX | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | PREVIOUS_APPLICATION_AMT_ANNUITY_SUM | PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN | PREVIOUS_APPLICATION_AMT_APPLICATION_MAX | PREVIOUS_APPLICATION_AMT_APPLICATION_MIN | PREVIOUS_APPLICATION_AMT_APPLICATION_SUM | PREVIOUS_APPLICATION_AMT_CREDIT_MEAN | PREVIOUS_APPLICATION_AMT_CREDIT_MAX | PREVIOUS_APPLICATION_AMT_CREDIT_MIN | PREVIOUS_APPLICATION_AMT_CREDIT_SUM | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | PREVIOUS_APPLICATION_DAYS_DECISION_MAX | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN | PREVIOUS_APPLICATION_PREV_APPS_COUNT | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_XNA | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SUNDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_UNUSED OFFER | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_XNA | PREVIOUS_APPLICATION_CODE_REJECT_REASON_CLIENT | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SYSTEM | PREVIOUS_APPLICATION_CODE_REJECT_REASON_VERIF | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XNA | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_X-SELL | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | POS_CASH_MONTHS_BALANCE_MIN | POS_CASH_MONTHS_BALANCE_MAX | POS_CASH_MONTHS_BALANCE_MEAN | POS_CASH_CNT_INSTALMENT_MEAN | POS_CASH_CNT_INSTALMENT_SUM | POS_CASH_CNT_INSTALMENT_MAX | POS_CASH_CNT_INSTALMENT_MIN | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | POS_CASH_CNT_INSTALMENT_FUTURE_MIN | POS_CASH_SK_DPD_MAX | POS_CASH_SK_DPD_MEAN | POS_CASH_SK_DPD_DEF_MAX | POS_CASH_SK_DPD_DEF_MEAN | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | POS_CASH_NAME_CONTRACT_STATUS_AMORTIZED DEBT | POS_CASH_NAME_CONTRACT_STATUS_APPROVED | POS_CASH_NAME_CONTRACT_STATUS_CANCELED | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | POS_CASH_NAME_CONTRACT_STATUS_DEMAND | POS_CASH_NAME_CONTRACT_STATUS_RETURNED TO THE STORE | POS_CASH_NAME_CONTRACT_STATUS_SIGNED | POS_CASH_NAME_CONTRACT_STATUS_XNA | POS_CASH_APP_COUNT | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN | CREDIT_CARD_BALANCE_AMT_BALANCE_SUM | CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN | CREDIT_CARD_BALANCE_AMT_BALANCE_MAX | CREDIT_CARD_BALANCE_AMT_BALANCE_MIN | CREDIT_CARD_BALANCE_AMT_BALANCE_STD | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MEAN | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MAX | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MIN | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MIN | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MIN | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MIN | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN | CREDIT_CARD_BALANCE_SK_DPD_MAX | CREDIT_CARD_BALANCE_SK_DPD_MEAN | CREDIT_CARD_BALANCE_SK_DPD_SUM | CREDIT_CARD_BALANCE_SK_DPD_DEF_MAX | CREDIT_CARD_BALANCE_SK_DPD_DEF_MEAN | CREDIT_CARD_BALANCE_SK_DPD_DEF_SUM | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX | CREDIT_CARD_BALANCE_STATUS_Active | CREDIT_CARD_BALANCE_STATUS_Approved | CREDIT_CARD_BALANCE_STATUS_Completed | CREDIT_CARD_BALANCE_STATUS_Demand | CREDIT_CARD_BALANCE_STATUS_Refused | CREDIT_CARD_BALANCE_STATUS_Sent proposal | CREDIT_CARD_BALANCE_STATUS_Signed | Próba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 513892 | 456230 | 0 | Cash loans | F | Y | Y | 1 | 292500.00 | 355536.00 | 18283.50 | 270000.00 | Unaccompanied | Commercial associate | Higher education | Civil marriage | House / apartment | 0.07 | -16010 | -1185 | -5034.00 | -4392 | 1 | 1 | 0 | 1 | 1 | 0 | Brak Danych | 3 | 1 | 1 | SATURDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 2 | 0.66 | 0.20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.00 | -1254.83 | -606.00 | -2102.00 | 1.00 | 12.00 | 12.00 | 2025.20 | 27336.00 | -1797.00 | 0.00 | 0.00 | 116386.88 | 1396642.50 | 25320.27 | 278523.00 | 0.00 | 17.62 | 211.50 | -3.00 | -1651.00 | Closed | currency 1 | Consumer credit | 150.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 186.00 | 48.00 | 35610.92 | 53109.32 | 21129.26 | 106832.74 | 278056.85 | 525730.50 | 123070.50 | 1390284.27 | 278764.96 | 525730.50 | 108522.00 | 1393824.78 | 278056.85 | 525730.50 | 123070.50 | 1390284.27 | 16.00 | 14.00 | 18.00 | -1458.20 | -76.00 | -2294.00 | -7291.00 | 1.01 | 1.13 | 0.94 | 5.00 | 0.00 | 5.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 2.00 | 0.00 | 0.00 | 1.00 | 3.00 | 0.00 | 0.00 | 2.00 | 0.00 | 1.00 | 4.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | 0.00 | 5.00 | 0.00 | 0.00 | 2.00 | 1.00 | 0.00 | 1.00 | 1.00 | XAP | Cash through the bank | Computers | Country-wide | POS household with interest | -96.00 | -1.00 | -40.96 | 9.42 | 245.00 | 12.00 | 6.00 | 4.42 | 115.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 24.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | 0.00 | 0.00 | 2.00 | 227.34 | 1563.94 | 0.00 | 0.00 | 0.00 | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Test |
| 513893 | 456234 | 0 | Cash loans | M | N | Y | 0 | 81000.00 | 135000.00 | 9148.50 | 135000.00 | Unaccompanied | Commercial associate | Higher education | Single / not married | House / apartment | 0.01 | -9874 | -1928 | -9445.00 | -2557 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | SATURDAY | 12 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 1 | 0.38 | 0.50 | 0 | 0 | 0 | 0 | -439 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.00 | -756.60 | -21.00 | -1201.00 | 0.00 | 0.00 | 0.00 | -205.10 | 1250.00 | -1047.00 | 0.00 | 0.00 | 139395.96 | 1393959.60 | 2452.03 | 17164.22 | 385.79 | 0.00 | 0.00 | -5.00 | -1055.00 | Closed | currency 1 | Consumer credit | 17.00 | 17.00 | 0.00 | 0.00 | 0.00 | 0.00 | 87.00 | 10.00 | 6870.57 | 11959.88 | 3697.70 | 20611.71 | 42535.12 | 90000.00 | 23800.50 | 170140.50 | 41704.88 | 87682.50 | 23800.50 | 166819.50 | 42535.12 | 90000.00 | 23800.50 | 170140.50 | 15.00 | 13.00 | 18.00 | -870.25 | -311.00 | -1732.00 | -3481.00 | 1.03 | 1.21 | 0.87 | 3.00 | 0.00 | 4.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 2.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.00 | 3.00 | 0.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 4.00 | 0.00 | 0.00 | 1.00 | 3.00 | 0.00 | 0.00 | 0.00 | XAP | Cash through the bank | Mobile | Country-wide | POS mobile with interest | -57.00 | -28.00 | -43.64 | 6.50 | 91.00 | 7.00 | 5.00 | 3.43 | 48.00 | 7.00 | 0.00 | 16.00 | 1.14 | 16.00 | 1.14 | 12.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 5.00 | 0.36 | 3.00 | 7555.41 | 12081.19 | 353.87 | 4954.14 | 4741.56 | 19.23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Test |
| 513894 | 456242 | 0 | Cash loans | M | Y | Y | 0 | 198000.00 | 1312110.00 | 52168.50 | 1125000.00 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.07 | -19102 | -3689 | -746.00 | -2650 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2 | 1 | 1 | MONDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Construction | 0.75 | 0.41 | 0 | 0 | 0 | 0 | -734 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.00 | -491.00 | -491.00 | -491.00 | 0.00 | 0.00 | 0.00 | 509.00 | 509.00 | 509.00 | 0.00 | 0.00 | 198000.00 | 198000.00 | 186097.50 | 186097.50 | 0.00 | 0.00 | 0.00 | -5.00 | -5.00 | Active | currency 1 | Credit card | 17.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 11054.52 | 18058.05 | 2250.00 | 44218.08 | 99441.00 | 141426.00 | 45000.00 | 397764.00 | 106664.62 | 173380.50 | 45000.00 | 426658.50 | 99441.00 | 141426.00 | 45000.00 | 397764.00 | 14.25 | 13.00 | 15.00 | -447.00 | -245.00 | -734.00 | -1788.00 | 0.95 | 1.07 | 0.82 | 18.00 | 0.00 | 3.00 | 1.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 4.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | 0.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 2.00 | XAP | XNA | Audio/Video | Country-wide | POS household with interest | -24.00 | -1.00 | -13.25 | 10.81 | 346.00 | 12.00 | 10.00 | 5.72 | 183.00 | 12.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 30.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | 1.00 | 0.02 | 2.00 | 3279.51 | 7737.54 | 0.00 | 0.00 | 0.00 | 0.59 | -8.00 | -1.00 | -4.50 | 1185858.63 | 148232.33 | 217274.80 | 14593.18 | 78923.49 | 1620000.00 | 202500.00 | 225000.00 | 45000.00 | 63639.61 | 318798.27 | 39849.78 | 98550.00 | 0.00 | 32827.58 | 156641.27 | 19580.16 | 45212.26 | 0.00 | 1182060.63 | 147757.58 | 216901.48 | 13702.18 | 79126.32 | 31.00 | 3.88 | 9.00 | 0.00 | 28.00 | 3.50 | 7.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.69 | 0.97 | 8.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | Test |
| 513895 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.00 | 254700.00 | 27558.00 | 225000.00 | Unaccompanied | Working | Secondary / secondary special | Separated | With parents | 0.03 | -9327 | -236 | -8456.00 | -1982 | 1 | 1 | 0 | 1 | 0 | 0 | Sales staff | 1 | 1 | 1 | THURSDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Services | 0.68 | NaN | 0 | 0 | 0 | 0 | -273 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Brak danych | Brak danych | Brak danych | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6605.91 | 6605.91 | 6605.91 | 6605.91 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 40455.00 | 17.00 | 17.00 | 17.00 | -273.00 | -273.00 | -273.00 | -273.00 | 1.00 | 1.00 | 1.00 | 12.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | XAP | Cash through the bank | Mobile | Country-wide | POS mobile with interest | -9.00 | -1.00 | -5.00 | 7.88 | 63.00 | 8.00 | 7.00 | 4.38 | 35.00 | 8.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 | 2346.82 | 2346.82 | 0.00 | 0.00 | 0.00 | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Test |
| 513896 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.00 | 370107.00 | 20205.00 | 319500.00 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.01 | -11961 | -4786 | -2562.00 | -931 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 9 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 1 | 0.51 | 0.66 | 0 | 0 | 0 | 0 | -322 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | -1104.00 | -1104.00 | -1104.00 | 0.00 | 0.00 | 0.00 | -859.00 | -859.00 | -859.00 | 0.00 | 0.00 | 45000.00 | 45000.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -401.00 | -401.00 | Closed | currency 1 | Consumer credit | 8.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 29.00 | 0.00 | 10681.13 | 19065.83 | 2296.44 | 21362.26 | 121317.75 | 223789.50 | 18846.00 | 242635.50 | 134439.75 | 247423.50 | 21456.00 | 268879.50 | 121317.75 | 223789.50 | 18846.00 | 242635.50 | 15.00 | 12.00 | 18.00 | -299.50 | -277.00 | -322.00 | -599.00 | 0.89 | 0.90 | 0.88 | 3.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | XAP | Cash through the bank | Computers | Country-wide | POS household with interest | -11.00 | -1.00 | -5.55 | 14.90 | 298.00 | 16.00 | 14.00 | 10.35 | 207.00 | 16.00 | 4.00 | 0.00 | 0.00 | 0.00 | 0.00 | 20.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Test |
# Podział na dwa zbiory train-uczący i test
df_train = df_all[df_all['Próba'] == 'Uczenie']
df_test = df_all[df_all['Próba'] == 'Test']
# Usunięcie z tych zbiorów kolumny 'Próba'
df_train = df_train.drop(columns=['Próba'])
df_test = df_test.drop(columns=['Próba'])
Sprawdzenie tabel df_train i df_test do dalszego modelowania¶
Zbiór treningowy¶
df_train.shape
(452394, 251)
# Obliczenie liczby wystąpień zmiennej celu oraz jej procentowego udziału w zbiorze danych
value_counts = df_train['TARGET'].value_counts()
percentage = (df_train['TARGET'].value_counts() / len(df_train)) * 100
# Stworzenie DataFrame
summary_table = pd.DataFrame({
'Value': value_counts.index,
'Count': value_counts.values,
'Percentage (%)': percentage.values
})
print(summary_table)
Value Count Percentage (%) 0 0 226197 50.00 1 1 226197 50.00
Można zaobserwować, że zbiór treningowy/uczacy został zbilansowany metodą oversampling
Zbiór testowy¶
df_test.shape
(61503, 251)
# Obliczenie liczby wystąpień zmiennej celu oraz jej procentowego udziału w zbiorze danych
value_counts = df_test['TARGET'].value_counts()
percentage = (df_test['TARGET'].value_counts() / len(df_test)) * 100
# Stworzenie DataFrame
summary_table = pd.DataFrame({
'Value': value_counts.index,
'Count': value_counts.values,
'Percentage (%)': percentage.values
})
print(summary_table)
Value Count Percentage (%) 0 0 56489 91.85 1 1 5014 8.15
Zbiór testowy natomiast nadal pozostaje niezbilansowany, żeby wyniki modeli były wiarygodne.
# Zrzucenie tabeli treningowej do csv, żeby można było ją zaczytać do Statistici
# df_train.to_csv('df_train1.csv', index=False)
Redukcja liczby zmiennych - odrzucenie słabych zmiennych¶
Wejściowy zbiór danych zawiera aż 249 zmiennych objaśniających. Jest to zdecydowanie za dużo i w pierwszej kolejności ograniczę liczbę tych zmiennych poprzez tzw. 'metody oparte na filtrach', czyli grupę metod, które są dość proste ale skuteczne. W ich skład wchodzi m.in. ustalenie Information Value (IV) oraz sprawdzenie korelacji miedzy zmiennymi, które przy zbyt wysokich wartościach mogą być współzależne miedzy sobą. Oczywiście w ramach tej grypy metod istnieją równiez inne jak próg wariancji, Chi kwadrat, Fisher Score i wiele innych, ale na potrzeby tego projektu, skorzystam głównie z tych dwóch.
Wybór predyktorów - Statistica¶
Pierwszy etapem ograniczenia liczby zmiennych będzie skorzystanie z dostępnego pakietu scoringowego w programie Statistica. W ramach tego pakietu dostępne są dwie możliwości redukcji liczby zmiennych: wybór predyktorów oraz wybór reprezentantów. W pierwszej kolejności skupię się na ograniczeniu liczby zmiennych poprzez funkcjonalność wyboru predyktorów.
Skorzystanie z opcji Wybór predyktorów - Ranking predyktorów umożliwia wykonanie rankingu zmiennych na podstawie miar Information Value, V Cramera oraz Gini a następnie ograniczenie zbioru danych jedynie do zmiennych istotnie wpływających na badane zjawisko.
Tak wygląda ta funkcjonalność w programie statistica:
Obszar Miary powiązania umożliwia wybór trzech miar siły predykcyjnej V Cramera, IV (Information Value) oraz wskaźnika Gini.
Obszar Wartość nietypowa pozwala na uwzględnienie w obliczeniach brakujących danych lub specjalne potraktowanie wartości wskazanej przez użytkownika jako nietypowa. Wybór opcji Brak danych powoduje, że brakujące dane są traktowane w analizie jako osobna kategoria i tym samym mają wpływ na obliczanie mocy predykcyjnej danej zmiennej.
Pole Liczba przedziałów znajdująca się w obszarze Dyskretyzacja pred. ilościowych pozwala określić liczbę równolicznych przedziałów, na jakie zostanie podzielona każda zmienna ilościowa przed obliczeniem miar powiązania. Jeżeli w obszarze Wartość nietypowa wskazano Wskaż wartość lub Brak danych, to będą one stanowiły dodatkową kategorię.
W polu zmienne wybrałem zmienną zależną jakościową oraz predyktory jakościowe i ilościowe.
Następnie otrzymałem ranking:
Wczytany zbiór danych z miarami poniżej
ranking_predyktorow = pd.read_excel('ranking predyktorow.xlsx', engine='openpyxl')
ranking_predyktorow
| Zmienna | IV | V Cramera | Gini | |
|---|---|---|---|---|
| 0 | EXT_SOURCE_3 | 0.33 | 0.28 | 0.25 |
| 1 | EXT_SOURCE_2 | 0.32 | 0.27 | 0.31 |
| 2 | BUREAU_DAYS_CREDIT_MEAN | 0.12 | 0.17 | 0.10 |
| 3 | DAYS_BIRTH | 0.09 | 0.15 | 0.17 |
| 4 | AMT_GOODS_PRICE | 0.09 | 0.15 | 0.07 |
| 5 | OCCUPATION_TYPE | 0.09 | 0.15 | 0.07 |
| 6 | DAYS_EMPLOYED | 0.08 | 0.14 | 0.07 |
| 7 | BUREAU_DAYS_CREDIT_MAX | 0.08 | 0.14 | 0.07 |
| 8 | BUREAU_DAYS_CREDIT_MIN | 0.08 | 0.14 | 0.07 |
| 9 | ORGANIZATION_TYPE | 0.08 | 0.14 | 0.05 |
| 10 | BUREAU_MOST_FREQ_CREDIT_ACTIVE | 0.07 | 0.13 | 0.13 |
| 11 | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | 0.07 | 0.13 | 0.06 |
| 12 | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | 0.06 | 0.12 | 0.13 |
| 13 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | 0.06 | 0.12 | 0.12 |
| 14 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | 0.06 | 0.12 | 0.11 |
| 15 | NAME_INCOME_TYPE | 0.06 | 0.12 | 0.10 |
| 16 | BUREAU_DAYS_CREDIT_UPDATE_MIN | 0.06 | 0.12 | 0.06 |
| 17 | BUREAU_DAYS_CREDIT_ENDDATE_MIN | 0.06 | 0.12 | 0.05 |
| 18 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | 0.06 | 0.12 | 0.04 |
| 19 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | 0.06 | 0.12 | 0.04 |
| 20 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | 0.06 | 0.12 | 0.02 |
| 21 | BUREAU_AMT_CREDIT_SUM_DEBT_MEAN | 0.05 | 0.12 | 0.04 |
| 22 | DAYS_LAST_PHONE_CHANGE | 0.05 | 0.11 | 0.12 |
| 23 | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | 0.05 | 0.11 | 0.12 |
| 24 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | 0.05 | 0.11 | 0.11 |
| 25 | REGION_RATING_CLIENT_W_CITY | 0.05 | 0.11 | 0.10 |
| 26 | REGION_RATING_CLIENT | 0.05 | 0.11 | 0.09 |
| 27 | NAME_EDUCATION_TYPE | 0.05 | 0.11 | 0.08 |
| 28 | BUREAU_DAYS_CREDIT_UPDATE_MAX | 0.05 | 0.11 | 0.04 |
| 29 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN | 0.05 | 0.11 | 0.04 |
| 30 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX | 0.05 | 0.11 | 0.04 |
| 31 | BUREAU_AMT_CREDIT_SUM_DEBT_SUM | 0.05 | 0.11 | 0.03 |
| 32 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN | 0.05 | 0.11 | 0.02 |
| 33 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | 0.05 | 0.11 | 0.02 |
| 34 | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | 0.04 | 0.10 | 0.12 |
| 35 | POS_CASH_MONTHS_BALANCE_MIN | 0.04 | 0.10 | 0.12 |
| 36 | DAYS_ID_PUBLISH | 0.04 | 0.10 | 0.11 |
| 37 | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | 0.04 | 0.10 | 0.10 |
| 38 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | 0.04 | 0.10 | 0.10 |
| 39 | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | 0.04 | 0.10 | 0.10 |
| 40 | CODE_GENDER | 0.04 | 0.10 | 0.10 |
| 41 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | 0.04 | 0.10 | 0.09 |
| 42 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | 0.04 | 0.10 | 0.08 |
| 43 | AMT_CREDIT | 0.04 | 0.10 | 0.04 |
| 44 | CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN | 0.04 | 0.10 | 0.04 |
| 45 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD | 0.04 | 0.10 | 0.04 |
| 46 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN | 0.04 | 0.10 | 0.04 |
| 47 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM | 0.04 | 0.10 | 0.04 |
| 48 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX | 0.04 | 0.10 | 0.04 |
| 49 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM | 0.04 | 0.10 | 0.03 |
| 50 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX | 0.04 | 0.10 | 0.03 |
| 51 | BUREAU_DAYS_CREDIT_ENDDATE_MAX | 0.04 | 0.10 | 0.02 |
| 52 | BUREAU_AMT_CREDIT_SUM_LIMIT_SUM | 0.04 | 0.10 | 0.01 |
| 53 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | 0.04 | 0.09 | 0.09 |
| 54 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR | 0.04 | 0.09 | 0.05 |
| 55 | POS_CASH_MONTHS_BALANCE_MEAN | 0.03 | 0.09 | 0.09 |
| 56 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | 0.03 | 0.09 | 0.09 |
| 57 | REG_CITY_NOT_WORK_CITY | 0.03 | 0.09 | 0.08 |
| 58 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | 0.03 | 0.09 | 0.08 |
| 59 | FLAG_EMP_PHONE | 0.03 | 0.09 | 0.07 |
| 60 | POS_CASH_APP_COUNT | 0.03 | 0.09 | 0.07 |
| 61 | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | 0.03 | 0.09 | 0.06 |
| 62 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | 0.03 | 0.09 | 0.06 |
| 63 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | 0.03 | 0.09 | 0.06 |
| 64 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY | 0.03 | 0.09 | 0.06 |
| 65 | CREDIT_CARD_BALANCE_AMT_BALANCE_MAX | 0.03 | 0.09 | 0.04 |
| 66 | CREDIT_CARD_BALANCE_AMT_BALANCE_STD | 0.03 | 0.09 | 0.04 |
| 67 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX | 0.03 | 0.09 | 0.04 |
| 68 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX | 0.03 | 0.09 | 0.04 |
| 69 | CREDIT_CARD_BALANCE_AMT_BALANCE_SUM | 0.03 | 0.09 | 0.03 |
| 70 | CREDIT_CARD_BALANCE_AMT_BALANCE_MIN | 0.03 | 0.09 | 0.03 |
| 71 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM | 0.03 | 0.09 | 0.03 |
| 72 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX | 0.03 | 0.09 | 0.03 |
| 73 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM | 0.03 | 0.09 | 0.03 |
| 74 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN | 0.03 | 0.09 | 0.03 |
| 75 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | 0.03 | 0.08 | 0.09 |
| 76 | DAYS_REGISTRATION | 0.03 | 0.08 | 0.08 |
| 77 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | 0.03 | 0.08 | 0.08 |
| 78 | FLAG_DOCUMENT_3 | 0.03 | 0.08 | 0.07 |
| 79 | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | 0.03 | 0.08 | 0.07 |
| 80 | REGION_POPULATION_RELATIVE | 0.03 | 0.08 | 0.06 |
| 81 | POS_CASH_SK_DPD_DEF_MEAN | 0.03 | 0.08 | 0.06 |
| 82 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL | 0.03 | 0.08 | 0.05 |
| 83 | BUREAU_MOST_FREQ_CREDIT_TYPE | 0.03 | 0.08 | 0.05 |
| 84 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD | 0.03 | 0.08 | 0.04 |
| 85 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN | 0.03 | 0.08 | 0.03 |
| 86 | AMT_ANNUITY | 0.03 | 0.08 | 0.00 |
| 87 | BUREAU_AMT_CREDIT_SUM_MEAN | 0.02 | 0.08 | 0.07 |
| 88 | POS_CASH_SK_DPD_DEF_MAX | 0.02 | 0.08 | 0.06 |
| 89 | REG_CITY_NOT_LIVE_CITY | 0.02 | 0.08 | 0.05 |
| 90 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN | 0.02 | 0.08 | 0.05 |
| 91 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN | 0.02 | 0.08 | 0.05 |
| 92 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN | 0.02 | 0.08 | 0.03 |
| 93 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | 0.02 | 0.08 | 0.02 |
| 94 | BUREAU_AMT_CREDIT_SUM_SUM | 0.02 | 0.07 | 0.07 |
| 95 | POS_CASH_SK_DPD_MAX | 0.02 | 0.07 | 0.06 |
| 96 | POS_CASH_SK_DPD_MEAN | 0.02 | 0.07 | 0.06 |
| 97 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE | 0.02 | 0.07 | 0.06 |
| 98 | BUREAU_BALANCE_STATUS_C | 0.02 | 0.07 | 0.05 |
| 99 | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | 0.02 | 0.07 | 0.05 |
| 100 | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | 0.02 | 0.07 | 0.05 |
| 101 | PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN | 0.02 | 0.07 | 0.04 |
| 102 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN | 0.02 | 0.07 | 0.04 |
| 103 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX | 0.02 | 0.07 | 0.04 |
| 104 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED | 0.02 | 0.07 | 0.04 |
| 105 | NAME_FAMILY_STATUS | 0.02 | 0.07 | 0.04 |
| 106 | BUREAU_CREDIT_DAY_OVERDUE_MEAN | 0.02 | 0.07 | 0.03 |
| 107 | BUREAU_CREDIT_DAY_OVERDUE_SUM | 0.02 | 0.07 | 0.03 |
| 108 | BUREAU_CREDIT_DAY_OVERDUE_MAX | 0.02 | 0.07 | 0.03 |
| 109 | BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN | 0.02 | 0.07 | 0.03 |
| 110 | BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM | 0.02 | 0.07 | 0.03 |
| 111 | PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN | 0.02 | 0.07 | 0.03 |
| 112 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED | 0.02 | 0.07 | 0.03 |
| 113 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION | 0.02 | 0.07 | 0.03 |
| 114 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN | 0.02 | 0.07 | 0.03 |
| 115 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN | 0.02 | 0.07 | 0.03 |
| 116 | AMT_REQ_CREDIT_BUREAU_YEAR | 0.02 | 0.07 | 0.01 |
| 117 | BUREAU_BALANCE_STATUS_1 | 0.02 | 0.07 | 0.01 |
| 118 | PREVIOUS_APPLICATION_AMT_APPLICATION_MAX | 0.02 | 0.07 | 0.01 |
| 119 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX | 0.02 | 0.07 | 0.01 |
| 120 | CREDIT_CARD_BALANCE_STATUS_Active | 0.02 | 0.07 | 0.01 |
| 121 | PREVIOUS_APPLICATION_AMT_CREDIT_MAX | 0.02 | 0.07 | 0.00 |
| 122 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE | 0.02 | 0.07 | 0.00 |
| 123 | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | 0.02 | 0.06 | 0.07 |
| 124 | POS_CASH_CNT_INSTALMENT_MEAN | 0.02 | 0.06 | 0.04 |
| 125 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | 0.02 | 0.06 | 0.04 |
| 126 | BUREAU_BALANCE_STATUS_0 | 0.02 | 0.06 | 0.03 |
| 127 | BUREAU_BALANCE_STATUS_X | 0.02 | 0.06 | 0.03 |
| 128 | PREVIOUS_APPLICATION_AMT_ANNUITY_MAX | 0.02 | 0.06 | 0.03 |
| 129 | PREVIOUS_APPLICATION_AMT_CREDIT_MIN | 0.02 | 0.06 | 0.03 |
| 130 | PREVIOUS_APPLICATION_AMT_APPLICATION_MIN | 0.02 | 0.06 | 0.02 |
| 131 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN | 0.02 | 0.06 | 0.02 |
| 132 | PREVIOUS_APPLICATION_DAYS_DECISION_MAX | 0.01 | 0.06 | 0.06 |
| 133 | LIVE_CITY_NOT_WORK_CITY | 0.01 | 0.06 | 0.05 |
| 134 | POS_CASH_CNT_INSTALMENT_MIN | 0.01 | 0.06 | 0.05 |
| 135 | DEF_30_CNT_SOCIAL_CIRCLE | 0.01 | 0.06 | 0.04 |
| 136 | BUREAU_BALANCE_STATUS_3 | 0.01 | 0.06 | 0.04 |
| 137 | BUREAU_BALANCE_STATUS_4 | 0.01 | 0.06 | 0.04 |
| 138 | BUREAU_BALANCE_STATUS_5 | 0.01 | 0.06 | 0.04 |
| 139 | FLAG_DOCUMENT_6 | 0.01 | 0.06 | 0.03 |
| 140 | BUREAU_BALANCE_STATUS_2 | 0.01 | 0.06 | 0.03 |
| 141 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS | 0.01 | 0.06 | 0.03 |
| 142 | NAME_CONTRACT_TYPE | 0.01 | 0.06 | 0.03 |
| 143 | NAME_HOUSING_TYPE | 0.01 | 0.06 | 0.03 |
| 144 | PREVIOUS_APPLICATION_AMT_ANNUITY_SUM | 0.01 | 0.06 | 0.02 |
| 145 | PREVIOUS_APPLICATION_AMT_CREDIT_MEAN | 0.01 | 0.06 | 0.02 |
| 146 | POS_CASH_CNT_INSTALMENT_MAX | 0.01 | 0.06 | 0.02 |
| 147 | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | 0.01 | 0.06 | 0.02 |
| 148 | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | 0.01 | 0.06 | 0.02 |
| 149 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN | 0.01 | 0.06 | 0.02 |
| 150 | PREVIOUS_APPLICATION_AMT_CREDIT_SUM | 0.01 | 0.06 | 0.01 |
| 151 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM | 0.01 | 0.06 | 0.01 |
| 152 | PREVIOUS_APPLICATION_AMT_APPLICATION_SUM | 0.01 | 0.06 | 0.00 |
| 153 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM | 0.01 | 0.06 | 0.00 |
| 154 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE | 0.01 | 0.06 | 0.00 |
| 155 | HOUR_APPR_PROCESS_START | 0.01 | 0.05 | 0.05 |
| 156 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS | 0.01 | 0.05 | 0.05 |
| 157 | AMT_INCOME_TOTAL | 0.01 | 0.05 | 0.04 |
| 158 | FLAG_WORK_PHONE | 0.01 | 0.05 | 0.04 |
| 159 | BUREAU_CNT_CREDIT_PROLONG_MEAN | 0.01 | 0.05 | 0.04 |
| 160 | BUREAU_CNT_CREDIT_PROLONG_SUM | 0.01 | 0.05 | 0.04 |
| 161 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED | 0.01 | 0.05 | 0.04 |
| 162 | BUREAU_MOST_FREQ_CREDIT_CURRENCY | 0.01 | 0.05 | 0.04 |
| 163 | DEF_60_CNT_SOCIAL_CIRCLE | 0.01 | 0.05 | 0.03 |
| 164 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY | 0.01 | 0.05 | 0.03 |
| 165 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY | 0.01 | 0.05 | 0.03 |
| 166 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY | 0.01 | 0.05 | 0.03 |
| 167 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY | 0.01 | 0.05 | 0.03 |
| 168 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY | 0.01 | 0.05 | 0.03 |
| 169 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER | 0.01 | 0.05 | 0.03 |
| 170 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO | 0.01 | 0.05 | 0.03 |
| 171 | POS_CASH_CNT_INSTALMENT_SUM | 0.01 | 0.05 | 0.03 |
| 172 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP | 0.01 | 0.05 | 0.02 |
| 173 | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | 0.01 | 0.05 | 0.02 |
| 174 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY | 0.01 | 0.05 | 0.01 |
| 175 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA | 0.01 | 0.05 | 0.01 |
| 176 | POS_CASH_MONTHS_BALANCE_MAX | 0.01 | 0.05 | 0.01 |
| 177 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE | 0.01 | 0.05 | 0.00 |
| 178 | CNT_CHILDREN | 0.01 | 0.04 | 0.04 |
| 179 | FLAG_PHONE | 0.01 | 0.04 | 0.04 |
| 180 | FLAG_OWN_CAR | 0.01 | 0.04 | 0.04 |
| 181 | POS_CASH_CNT_INSTALMENT_FUTURE_MIN | 0.01 | 0.04 | 0.03 |
| 182 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX | 0.01 | 0.04 | 0.03 |
| 183 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MIN | 0.01 | 0.04 | 0.03 |
| 184 | CNT_FAM_MEMBERS | 0.01 | 0.04 | 0.02 |
| 185 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_XNA | 0.01 | 0.04 | 0.02 |
| 186 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_UNUSED OFFER | 0.01 | 0.04 | 0.02 |
| 187 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_XNA | 0.01 | 0.04 | 0.02 |
| 188 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_CLIENT | 0.01 | 0.04 | 0.02 |
| 189 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SYSTEM | 0.01 | 0.04 | 0.02 |
| 190 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_VERIF | 0.01 | 0.04 | 0.02 |
| 191 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XNA | 0.01 | 0.04 | 0.02 |
| 192 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MIN | 0.01 | 0.04 | 0.02 |
| 193 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MIN | 0.01 | 0.04 | 0.02 |
| 194 | CREDIT_CARD_BALANCE_SK_DPD_MAX | 0.01 | 0.04 | 0.02 |
| 195 | CREDIT_CARD_BALANCE_SK_DPD_MEAN | 0.01 | 0.04 | 0.02 |
| 196 | CREDIT_CARD_BALANCE_SK_DPD_DEF_MAX | 0.01 | 0.04 | 0.02 |
| 197 | CREDIT_CARD_BALANCE_STATUS_Completed | 0.01 | 0.04 | 0.02 |
| 198 | PREVIOUS_APPLICATION_PREV_APPS_COUNT | 0.01 | 0.04 | 0.01 |
| 199 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SUNDAY | 0.01 | 0.04 | 0.01 |
| 200 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_X-SELL | 0.01 | 0.04 | 0.01 |
| 201 | CREDIT_CARD_BALANCE_SK_DPD_DEF_SUM | 0.00 | 0.04 | 0.02 |
| 202 | OBS_30_CNT_SOCIAL_CIRCLE | 0.00 | 0.03 | 0.02 |
| 203 | OBS_60_CNT_SOCIAL_CIRCLE | 0.00 | 0.03 | 0.02 |
| 204 | POS_CASH_NAME_CONTRACT_STATUS_RETURNED TO THE STORE | 0.00 | 0.03 | 0.02 |
| 205 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MEAN | 0.00 | 0.03 | 0.02 |
| 206 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MAX | 0.00 | 0.03 | 0.02 |
| 207 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD | 0.00 | 0.03 | 0.02 |
| 208 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MIN | 0.00 | 0.03 | 0.02 |
| 209 | CREDIT_CARD_BALANCE_SK_DPD_SUM | 0.00 | 0.03 | 0.02 |
| 210 | CREDIT_CARD_BALANCE_SK_DPD_DEF_MEAN | 0.00 | 0.03 | 0.02 |
| 211 | CREDIT_CARD_BALANCE_STATUS_Approved | 0.00 | 0.03 | 0.02 |
| 212 | CREDIT_CARD_BALANCE_STATUS_Demand | 0.00 | 0.03 | 0.02 |
| 213 | CREDIT_CARD_BALANCE_STATUS_Refused | 0.00 | 0.03 | 0.02 |
| 214 | CREDIT_CARD_BALANCE_STATUS_Sent proposal | 0.00 | 0.03 | 0.02 |
| 215 | CREDIT_CARD_BALANCE_STATUS_Signed | 0.00 | 0.03 | 0.02 |
| 216 | POS_CASH_NAME_CONTRACT_STATUS_AMORTIZED DEBT | 0.00 | 0.03 | 0.01 |
| 217 | POS_CASH_NAME_CONTRACT_STATUS_APPROVED | 0.00 | 0.03 | 0.01 |
| 218 | POS_CASH_NAME_CONTRACT_STATUS_CANCELED | 0.00 | 0.03 | 0.01 |
| 219 | POS_CASH_NAME_CONTRACT_STATUS_DEMAND | 0.00 | 0.03 | 0.01 |
| 220 | POS_CASH_NAME_CONTRACT_STATUS_SIGNED | 0.00 | 0.03 | 0.01 |
| 221 | POS_CASH_NAME_CONTRACT_STATUS_XNA | 0.00 | 0.03 | 0.01 |
| 222 | NAME_TYPE_SUITE | 0.00 | 0.03 | 0.01 |
| 223 | FLAG_DOCUMENT_8 | 0.00 | 0.02 | 0.01 |
| 224 | FLAG_DOCUMENT_13 | 0.00 | 0.02 | 0.00 |
| 225 | FLAG_DOCUMENT_14 | 0.00 | 0.02 | 0.00 |
| 226 | FLAG_DOCUMENT_16 | 0.00 | 0.02 | 0.00 |
| 227 | REG_REGION_NOT_WORK_REGION | 0.00 | 0.01 | 0.01 |
| 228 | FLAG_OWN_REALTY | 0.00 | 0.01 | 0.01 |
| 229 | REG_REGION_NOT_LIVE_REGION | 0.00 | 0.01 | 0.00 |
| 230 | FLAG_DOCUMENT_4 | 0.00 | 0.01 | 0.00 |
| 231 | FLAG_DOCUMENT_9 | 0.00 | 0.01 | 0.00 |
| 232 | FLAG_DOCUMENT_11 | 0.00 | 0.01 | 0.00 |
| 233 | FLAG_DOCUMENT_15 | 0.00 | 0.01 | 0.00 |
| 234 | FLAG_DOCUMENT_17 | 0.00 | 0.01 | 0.00 |
| 235 | FLAG_DOCUMENT_18 | 0.00 | 0.01 | 0.00 |
| 236 | FLAG_DOCUMENT_21 | 0.00 | 0.01 | 0.00 |
| 237 | WEEKDAY_APPR_PROCESS_START | 0.00 | 0.01 | 0.00 |
| 238 | FLAG_MOBIL | 0.00 | 0.00 | 0.00 |
| 239 | FLAG_CONT_MOBILE | 0.00 | 0.00 | 0.00 |
| 240 | FLAG_EMAIL | 0.00 | 0.00 | 0.00 |
| 241 | LIVE_REGION_NOT_WORK_REGION | 0.00 | 0.00 | 0.00 |
| 242 | FLAG_DOCUMENT_2 | 0.00 | 0.00 | 0.00 |
| 243 | FLAG_DOCUMENT_5 | 0.00 | 0.00 | 0.00 |
| 244 | FLAG_DOCUMENT_7 | 0.00 | 0.00 | 0.00 |
| 245 | FLAG_DOCUMENT_10 | 0.00 | 0.00 | 0.00 |
| 246 | FLAG_DOCUMENT_19 | 0.00 | 0.00 | 0.00 |
| 247 | FLAG_DOCUMENT_20 | 0.00 | 0.00 | 0.00 |
Na podstawie informacji dotyczącej przedziałów wartośći IV, można sądzić, że najlepszymi zmiennymi do modelu będą:
- EXT_SOURCE_3
- EXT_SOURCE_2
- BUREAU_DAYS_CREDIT_MEAN
- AMT_GOODS_PRICE
- DAYS_BIRTH
Na początek usuwam wszystkie zmienne, które mają wszystkie słabe współczynniki jednocześnie dla każdej z miar, czyli zarówno:
- IV < 0,05 (bardzo słabe zmienne są poniżej 0,02 ale 0,05 to wciąż słabe predykcje, zmiennych mam na tyle dużo, że mogę pozwolić sobie na wybór tylko tych powyżej 0,05);
- V Cramera < 0,05;
- Gini < 0,05
# Definiowanie warunków
condition = (ranking_predyktorow['IV'] < 0.05) & (ranking_predyktorow['V Cramera'] < 0.05) & (ranking_predyktorow['Gini'] < 0.05)
# Filtrowanie DataFrame, aby znaleźć rekordy spełniające te dwa warunki
filtered_rows = ranking_predyktorow[condition]
filtered_rows.head(5) #ok
| Zmienna | IV | V Cramera | Gini | |
|---|---|---|---|---|
| 178 | CNT_CHILDREN | 0.01 | 0.04 | 0.04 |
| 179 | FLAG_PHONE | 0.01 | 0.04 | 0.04 |
| 180 | FLAG_OWN_CAR | 0.01 | 0.04 | 0.04 |
| 181 | POS_CASH_CNT_INSTALMENT_FUTURE_MIN | 0.01 | 0.04 | 0.03 |
| 182 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX | 0.01 | 0.04 | 0.03 |
filtered_rows.shape
(70, 4)
var_to_drop = filtered_rows['Zmienna'].tolist()
df_train.drop(columns=var_to_drop, inplace=True)
df_test.drop(columns=var_to_drop, inplace=True)
df_all.drop(columns=var_to_drop, inplace=True)
df_train.shape #pozostało 181 zmiennych, usunięto 70 czyli tak jak miało być
(452394, 181)
df_test.shape
(61503, 181)
df_all.shape #df_all będzie zawsze zawierał o 1 zmienną więcej. Jest to zmienna Próba, która wskazuje na rozróżnienie między zbiorem uczacym a testowym
(513897, 182)
Szczegółowa redukcja zmiennych w oparciu o ranking predyktorów i miary IV, V Cramera orz Gini¶
# Ranking pozostałych zmiennych z dołączeniem informacji o typie danych konkretnej zmiennej
rest_predictors = ranking_predyktorow[~ranking_predyktorow['Zmienna'].isin(filtered_rows['Zmienna'])]
predictors = pd.merge(rest_predictors,results_df[['Variable','Data Type']], how='inner', left_on='Zmienna', right_on='Variable')
predictors[['Zmienna', 'Data Type', 'IV', 'V Cramera', 'Gini']]
| Zmienna | Data Type | IV | V Cramera | Gini | |
|---|---|---|---|---|---|
| 0 | EXT_SOURCE_3 | float64 | 0.33 | 0.28 | 0.25 |
| 1 | EXT_SOURCE_2 | float64 | 0.32 | 0.27 | 0.31 |
| 2 | BUREAU_DAYS_CREDIT_MEAN | float64 | 0.12 | 0.17 | 0.10 |
| 3 | DAYS_BIRTH | int64 | 0.09 | 0.15 | 0.17 |
| 4 | AMT_GOODS_PRICE | float64 | 0.09 | 0.15 | 0.07 |
| 5 | OCCUPATION_TYPE | object | 0.09 | 0.15 | 0.07 |
| 6 | DAYS_EMPLOYED | int64 | 0.08 | 0.14 | 0.07 |
| 7 | BUREAU_DAYS_CREDIT_MAX | float64 | 0.08 | 0.14 | 0.07 |
| 8 | BUREAU_DAYS_CREDIT_MIN | float64 | 0.08 | 0.14 | 0.07 |
| 9 | ORGANIZATION_TYPE | object | 0.08 | 0.14 | 0.05 |
| 10 | BUREAU_MOST_FREQ_CREDIT_ACTIVE | object | 0.07 | 0.13 | 0.13 |
| 11 | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | float64 | 0.07 | 0.13 | 0.06 |
| 12 | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | float64 | 0.06 | 0.12 | 0.13 |
| 13 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | float64 | 0.06 | 0.12 | 0.12 |
| 14 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | float64 | 0.06 | 0.12 | 0.11 |
| 15 | NAME_INCOME_TYPE | object | 0.06 | 0.12 | 0.10 |
| 16 | BUREAU_DAYS_CREDIT_UPDATE_MIN | float64 | 0.06 | 0.12 | 0.06 |
| 17 | BUREAU_DAYS_CREDIT_ENDDATE_MIN | float64 | 0.06 | 0.12 | 0.05 |
| 18 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | float64 | 0.06 | 0.12 | 0.04 |
| 19 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | float64 | 0.06 | 0.12 | 0.04 |
| 20 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | float64 | 0.06 | 0.12 | 0.02 |
| 21 | BUREAU_AMT_CREDIT_SUM_DEBT_MEAN | float64 | 0.05 | 0.12 | 0.04 |
| 22 | DAYS_LAST_PHONE_CHANGE | float64 | 0.05 | 0.11 | 0.12 |
| 23 | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | float64 | 0.05 | 0.11 | 0.12 |
| 24 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | float64 | 0.05 | 0.11 | 0.11 |
| 25 | REGION_RATING_CLIENT_W_CITY | int64 | 0.05 | 0.11 | 0.10 |
| 26 | REGION_RATING_CLIENT | int64 | 0.05 | 0.11 | 0.09 |
| 27 | NAME_EDUCATION_TYPE | object | 0.05 | 0.11 | 0.08 |
| 28 | BUREAU_DAYS_CREDIT_UPDATE_MAX | float64 | 0.05 | 0.11 | 0.04 |
| 29 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN | float64 | 0.05 | 0.11 | 0.04 |
| 30 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX | float64 | 0.05 | 0.11 | 0.04 |
| 31 | BUREAU_AMT_CREDIT_SUM_DEBT_SUM | float64 | 0.05 | 0.11 | 0.03 |
| 32 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN | float64 | 0.05 | 0.11 | 0.02 |
| 33 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | object | 0.05 | 0.11 | 0.02 |
| 34 | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | float64 | 0.04 | 0.10 | 0.12 |
| 35 | POS_CASH_MONTHS_BALANCE_MIN | float64 | 0.04 | 0.10 | 0.12 |
| 36 | DAYS_ID_PUBLISH | int64 | 0.04 | 0.10 | 0.11 |
| 37 | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | float64 | 0.04 | 0.10 | 0.10 |
| 38 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | float64 | 0.04 | 0.10 | 0.10 |
| 39 | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | float64 | 0.04 | 0.10 | 0.10 |
| 40 | CODE_GENDER | object | 0.04 | 0.10 | 0.10 |
| 41 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | float64 | 0.04 | 0.10 | 0.09 |
| 42 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | float64 | 0.04 | 0.10 | 0.08 |
| 43 | AMT_CREDIT | float64 | 0.04 | 0.10 | 0.04 |
| 44 | CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN | float64 | 0.04 | 0.10 | 0.04 |
| 45 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD | float64 | 0.04 | 0.10 | 0.04 |
| 46 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN | float64 | 0.04 | 0.10 | 0.04 |
| 47 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM | float64 | 0.04 | 0.10 | 0.04 |
| 48 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX | float64 | 0.04 | 0.10 | 0.04 |
| 49 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM | float64 | 0.04 | 0.10 | 0.03 |
| 50 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX | float64 | 0.04 | 0.10 | 0.03 |
| 51 | BUREAU_DAYS_CREDIT_ENDDATE_MAX | float64 | 0.04 | 0.10 | 0.02 |
| 52 | BUREAU_AMT_CREDIT_SUM_LIMIT_SUM | float64 | 0.04 | 0.10 | 0.01 |
| 53 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | float64 | 0.04 | 0.09 | 0.09 |
| 54 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR | float64 | 0.04 | 0.09 | 0.05 |
| 55 | POS_CASH_MONTHS_BALANCE_MEAN | float64 | 0.03 | 0.09 | 0.09 |
| 56 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | float64 | 0.03 | 0.09 | 0.09 |
| 57 | REG_CITY_NOT_WORK_CITY | int64 | 0.03 | 0.09 | 0.08 |
| 58 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | float64 | 0.03 | 0.09 | 0.08 |
| 59 | FLAG_EMP_PHONE | int64 | 0.03 | 0.09 | 0.07 |
| 60 | POS_CASH_APP_COUNT | float64 | 0.03 | 0.09 | 0.07 |
| 61 | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | float64 | 0.03 | 0.09 | 0.06 |
| 62 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | float64 | 0.03 | 0.09 | 0.06 |
| 63 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | float64 | 0.03 | 0.09 | 0.06 |
| 64 | CREDIT_CARD_BALANCE_AMT_BALANCE_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 65 | CREDIT_CARD_BALANCE_AMT_BALANCE_STD | float64 | 0.03 | 0.09 | 0.04 |
| 66 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 67 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 68 | CREDIT_CARD_BALANCE_AMT_BALANCE_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 69 | CREDIT_CARD_BALANCE_AMT_BALANCE_MIN | float64 | 0.03 | 0.09 | 0.03 |
| 70 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 71 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX | float64 | 0.03 | 0.09 | 0.03 |
| 72 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 73 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN | float64 | 0.03 | 0.09 | 0.03 |
| 74 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | float64 | 0.03 | 0.08 | 0.09 |
| 75 | DAYS_REGISTRATION | float64 | 0.03 | 0.08 | 0.08 |
| 76 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | float64 | 0.03 | 0.08 | 0.08 |
| 77 | FLAG_DOCUMENT_3 | int64 | 0.03 | 0.08 | 0.07 |
| 78 | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | float64 | 0.03 | 0.08 | 0.07 |
| 79 | REGION_POPULATION_RELATIVE | float64 | 0.03 | 0.08 | 0.06 |
| 80 | POS_CASH_SK_DPD_DEF_MEAN | float64 | 0.03 | 0.08 | 0.06 |
| 81 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL | float64 | 0.03 | 0.08 | 0.05 |
| 82 | BUREAU_MOST_FREQ_CREDIT_TYPE | object | 0.03 | 0.08 | 0.05 |
| 83 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD | float64 | 0.03 | 0.08 | 0.04 |
| 84 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN | float64 | 0.03 | 0.08 | 0.03 |
| 85 | AMT_ANNUITY | float64 | 0.03 | 0.08 | 0.00 |
| 86 | BUREAU_AMT_CREDIT_SUM_MEAN | float64 | 0.02 | 0.08 | 0.07 |
| 87 | POS_CASH_SK_DPD_DEF_MAX | float64 | 0.02 | 0.08 | 0.06 |
| 88 | REG_CITY_NOT_LIVE_CITY | int64 | 0.02 | 0.08 | 0.05 |
| 89 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN | float64 | 0.02 | 0.08 | 0.05 |
| 90 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN | float64 | 0.02 | 0.08 | 0.05 |
| 91 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN | float64 | 0.02 | 0.08 | 0.03 |
| 92 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | float64 | 0.02 | 0.08 | 0.02 |
| 93 | BUREAU_AMT_CREDIT_SUM_SUM | float64 | 0.02 | 0.07 | 0.07 |
| 94 | POS_CASH_SK_DPD_MAX | float64 | 0.02 | 0.07 | 0.06 |
| 95 | POS_CASH_SK_DPD_MEAN | float64 | 0.02 | 0.07 | 0.06 |
| 96 | BUREAU_BALANCE_STATUS_C | float64 | 0.02 | 0.07 | 0.05 |
| 97 | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | float64 | 0.02 | 0.07 | 0.05 |
| 98 | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | float64 | 0.02 | 0.07 | 0.05 |
| 99 | PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN | float64 | 0.02 | 0.07 | 0.04 |
| 100 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN | float64 | 0.02 | 0.07 | 0.04 |
| 101 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX | float64 | 0.02 | 0.07 | 0.04 |
| 102 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED | float64 | 0.02 | 0.07 | 0.04 |
| 103 | NAME_FAMILY_STATUS | object | 0.02 | 0.07 | 0.04 |
| 104 | BUREAU_CREDIT_DAY_OVERDUE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 105 | BUREAU_CREDIT_DAY_OVERDUE_SUM | float64 | 0.02 | 0.07 | 0.03 |
| 106 | BUREAU_CREDIT_DAY_OVERDUE_MAX | float64 | 0.02 | 0.07 | 0.03 |
| 107 | BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 108 | BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM | float64 | 0.02 | 0.07 | 0.03 |
| 109 | PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 110 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED | float64 | 0.02 | 0.07 | 0.03 |
| 111 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION | float64 | 0.02 | 0.07 | 0.03 |
| 112 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN | float64 | 0.02 | 0.07 | 0.03 |
| 113 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 114 | AMT_REQ_CREDIT_BUREAU_YEAR | float64 | 0.02 | 0.07 | 0.01 |
| 115 | BUREAU_BALANCE_STATUS_1 | float64 | 0.02 | 0.07 | 0.01 |
| 116 | PREVIOUS_APPLICATION_AMT_APPLICATION_MAX | float64 | 0.02 | 0.07 | 0.01 |
| 117 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX | float64 | 0.02 | 0.07 | 0.01 |
| 118 | CREDIT_CARD_BALANCE_STATUS_Active | float64 | 0.02 | 0.07 | 0.01 |
| 119 | PREVIOUS_APPLICATION_AMT_CREDIT_MAX | float64 | 0.02 | 0.07 | 0.00 |
| 120 | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | float64 | 0.02 | 0.06 | 0.07 |
| 121 | POS_CASH_CNT_INSTALMENT_MEAN | float64 | 0.02 | 0.06 | 0.04 |
| 122 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | float64 | 0.02 | 0.06 | 0.04 |
| 123 | BUREAU_BALANCE_STATUS_0 | float64 | 0.02 | 0.06 | 0.03 |
| 124 | BUREAU_BALANCE_STATUS_X | float64 | 0.02 | 0.06 | 0.03 |
| 125 | PREVIOUS_APPLICATION_AMT_ANNUITY_MAX | float64 | 0.02 | 0.06 | 0.03 |
| 126 | PREVIOUS_APPLICATION_AMT_CREDIT_MIN | float64 | 0.02 | 0.06 | 0.03 |
| 127 | PREVIOUS_APPLICATION_AMT_APPLICATION_MIN | float64 | 0.02 | 0.06 | 0.02 |
| 128 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN | float64 | 0.02 | 0.06 | 0.02 |
| 129 | PREVIOUS_APPLICATION_DAYS_DECISION_MAX | float64 | 0.01 | 0.06 | 0.06 |
| 130 | LIVE_CITY_NOT_WORK_CITY | int64 | 0.01 | 0.06 | 0.05 |
| 131 | POS_CASH_CNT_INSTALMENT_MIN | float64 | 0.01 | 0.06 | 0.05 |
| 132 | DEF_30_CNT_SOCIAL_CIRCLE | float64 | 0.01 | 0.06 | 0.04 |
| 133 | BUREAU_BALANCE_STATUS_3 | float64 | 0.01 | 0.06 | 0.04 |
| 134 | BUREAU_BALANCE_STATUS_4 | float64 | 0.01 | 0.06 | 0.04 |
| 135 | BUREAU_BALANCE_STATUS_5 | float64 | 0.01 | 0.06 | 0.04 |
| 136 | FLAG_DOCUMENT_6 | int64 | 0.01 | 0.06 | 0.03 |
| 137 | BUREAU_BALANCE_STATUS_2 | float64 | 0.01 | 0.06 | 0.03 |
| 138 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS | float64 | 0.01 | 0.06 | 0.03 |
| 139 | NAME_CONTRACT_TYPE | object | 0.01 | 0.06 | 0.03 |
| 140 | NAME_HOUSING_TYPE | object | 0.01 | 0.06 | 0.03 |
| 141 | PREVIOUS_APPLICATION_AMT_ANNUITY_SUM | float64 | 0.01 | 0.06 | 0.02 |
| 142 | PREVIOUS_APPLICATION_AMT_CREDIT_MEAN | float64 | 0.01 | 0.06 | 0.02 |
| 143 | POS_CASH_CNT_INSTALMENT_MAX | float64 | 0.01 | 0.06 | 0.02 |
| 144 | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | float64 | 0.01 | 0.06 | 0.02 |
| 145 | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | float64 | 0.01 | 0.06 | 0.02 |
| 146 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN | float64 | 0.01 | 0.06 | 0.02 |
| 147 | PREVIOUS_APPLICATION_AMT_CREDIT_SUM | float64 | 0.01 | 0.06 | 0.01 |
| 148 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM | float64 | 0.01 | 0.06 | 0.01 |
| 149 | PREVIOUS_APPLICATION_AMT_APPLICATION_SUM | float64 | 0.01 | 0.06 | 0.00 |
| 150 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM | float64 | 0.01 | 0.06 | 0.00 |
| 151 | HOUR_APPR_PROCESS_START | int64 | 0.01 | 0.05 | 0.05 |
| 152 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS | float64 | 0.01 | 0.05 | 0.05 |
| 153 | AMT_INCOME_TOTAL | float64 | 0.01 | 0.05 | 0.04 |
| 154 | FLAG_WORK_PHONE | int64 | 0.01 | 0.05 | 0.04 |
| 155 | BUREAU_CNT_CREDIT_PROLONG_MEAN | float64 | 0.01 | 0.05 | 0.04 |
| 156 | BUREAU_CNT_CREDIT_PROLONG_SUM | float64 | 0.01 | 0.05 | 0.04 |
| 157 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED | float64 | 0.01 | 0.05 | 0.04 |
| 158 | BUREAU_MOST_FREQ_CREDIT_CURRENCY | object | 0.01 | 0.05 | 0.04 |
| 159 | DEF_60_CNT_SOCIAL_CIRCLE | float64 | 0.01 | 0.05 | 0.03 |
| 160 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY | float64 | 0.01 | 0.05 | 0.03 |
| 161 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY | float64 | 0.01 | 0.05 | 0.03 |
| 162 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY | float64 | 0.01 | 0.05 | 0.03 |
| 163 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY | float64 | 0.01 | 0.05 | 0.03 |
| 164 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY | float64 | 0.01 | 0.05 | 0.03 |
| 165 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER | float64 | 0.01 | 0.05 | 0.03 |
| 166 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO | float64 | 0.01 | 0.05 | 0.03 |
| 167 | POS_CASH_CNT_INSTALMENT_SUM | float64 | 0.01 | 0.05 | 0.03 |
| 168 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP | float64 | 0.01 | 0.05 | 0.02 |
| 169 | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | float64 | 0.01 | 0.05 | 0.02 |
| 170 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY | float64 | 0.01 | 0.05 | 0.01 |
| 171 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA | float64 | 0.01 | 0.05 | 0.01 |
| 172 | POS_CASH_MONTHS_BALANCE_MAX | float64 | 0.01 | 0.05 | 0.01 |
| 173 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE | float64 | 0.01 | 0.05 | 0.00 |
Zbadanie wspł. V Cramera¶
# Współczynnik V-Creamera można stosować wyłącznie do zmiennych kategorycznych, w zwiazku z tym zostanie on szczegółowo sprawdzony dla tych zmiennych
category_cols = predictors[predictors['Data Type'] == 'object']
category_cols[['Zmienna', 'Data Type', 'IV', 'V Cramera', 'Gini']]
| Zmienna | Data Type | IV | V Cramera | Gini | |
|---|---|---|---|---|---|
| 5 | OCCUPATION_TYPE | object | 0.09 | 0.15 | 0.07 |
| 9 | ORGANIZATION_TYPE | object | 0.08 | 0.14 | 0.05 |
| 10 | BUREAU_MOST_FREQ_CREDIT_ACTIVE | object | 0.07 | 0.13 | 0.13 |
| 15 | NAME_INCOME_TYPE | object | 0.06 | 0.12 | 0.10 |
| 27 | NAME_EDUCATION_TYPE | object | 0.05 | 0.11 | 0.08 |
| 33 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | object | 0.05 | 0.11 | 0.02 |
| 40 | CODE_GENDER | object | 0.04 | 0.10 | 0.10 |
| 82 | BUREAU_MOST_FREQ_CREDIT_TYPE | object | 0.03 | 0.08 | 0.05 |
| 103 | NAME_FAMILY_STATUS | object | 0.02 | 0.07 | 0.04 |
| 139 | NAME_CONTRACT_TYPE | object | 0.01 | 0.06 | 0.03 |
| 140 | NAME_HOUSING_TYPE | object | 0.01 | 0.06 | 0.03 |
| 158 | BUREAU_MOST_FREQ_CREDIT_CURRENCY | object | 0.01 | 0.05 | 0.04 |
# Definiowanie warunków
condition = (category_cols['IV'] <= 0.05) & (category_cols['V Cramera'] <= 0.05)
# Filtrowanie DataFrame, aby znaleźć rekordy spełniające wszystkie trzy warunki
filtered_rows = category_cols[condition]
filtered_rows[['Zmienna', 'Data Type', 'IV', 'V Cramera', 'Gini']]
| Zmienna | Data Type | IV | V Cramera | Gini | |
|---|---|---|---|---|---|
| 158 | BUREAU_MOST_FREQ_CREDIT_CURRENCY | object | 0.01 | 0.05 | 0.04 |
var_to_drop = filtered_rows['Zmienna'].tolist()
var_to_drop
['BUREAU_MOST_FREQ_CREDIT_CURRENCY']
df_train.drop(columns=var_to_drop, inplace=True)
df_test.drop(columns=var_to_drop, inplace=True)
df_all.drop(columns=var_to_drop, inplace=True)
df_train.shape #usunąłem 1 zmienną kategoryczną, która miała bardzo słabe IV oraz V Cramera (nie patrzyłem na Giniego)
(452394, 180)
df_test.shape
(61503, 180)
df_all.shape
(513897, 181)
Zbadanie wskaźnika Giniego¶
# Współczynnik Giniego najlepiej stosować do zmiennych ilościowych
numeric_cols = predictors[(predictors['Data Type'] == 'float64') | (predictors['Data Type'] == 'int64')]
numeric_cols[['Zmienna', 'Data Type', 'IV', 'V Cramera', 'Gini']]
| Zmienna | Data Type | IV | V Cramera | Gini | |
|---|---|---|---|---|---|
| 0 | EXT_SOURCE_3 | float64 | 0.33 | 0.28 | 0.25 |
| 1 | EXT_SOURCE_2 | float64 | 0.32 | 0.27 | 0.31 |
| 2 | BUREAU_DAYS_CREDIT_MEAN | float64 | 0.12 | 0.17 | 0.10 |
| 3 | DAYS_BIRTH | int64 | 0.09 | 0.15 | 0.17 |
| 4 | AMT_GOODS_PRICE | float64 | 0.09 | 0.15 | 0.07 |
| 6 | DAYS_EMPLOYED | int64 | 0.08 | 0.14 | 0.07 |
| 7 | BUREAU_DAYS_CREDIT_MAX | float64 | 0.08 | 0.14 | 0.07 |
| 8 | BUREAU_DAYS_CREDIT_MIN | float64 | 0.08 | 0.14 | 0.07 |
| 11 | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | float64 | 0.07 | 0.13 | 0.06 |
| 12 | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | float64 | 0.06 | 0.12 | 0.13 |
| 13 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | float64 | 0.06 | 0.12 | 0.12 |
| 14 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | float64 | 0.06 | 0.12 | 0.11 |
| 16 | BUREAU_DAYS_CREDIT_UPDATE_MIN | float64 | 0.06 | 0.12 | 0.06 |
| 17 | BUREAU_DAYS_CREDIT_ENDDATE_MIN | float64 | 0.06 | 0.12 | 0.05 |
| 18 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | float64 | 0.06 | 0.12 | 0.04 |
| 19 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | float64 | 0.06 | 0.12 | 0.04 |
| 20 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | float64 | 0.06 | 0.12 | 0.02 |
| 21 | BUREAU_AMT_CREDIT_SUM_DEBT_MEAN | float64 | 0.05 | 0.12 | 0.04 |
| 22 | DAYS_LAST_PHONE_CHANGE | float64 | 0.05 | 0.11 | 0.12 |
| 23 | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | float64 | 0.05 | 0.11 | 0.12 |
| 24 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | float64 | 0.05 | 0.11 | 0.11 |
| 25 | REGION_RATING_CLIENT_W_CITY | int64 | 0.05 | 0.11 | 0.10 |
| 26 | REGION_RATING_CLIENT | int64 | 0.05 | 0.11 | 0.09 |
| 28 | BUREAU_DAYS_CREDIT_UPDATE_MAX | float64 | 0.05 | 0.11 | 0.04 |
| 29 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN | float64 | 0.05 | 0.11 | 0.04 |
| 30 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX | float64 | 0.05 | 0.11 | 0.04 |
| 31 | BUREAU_AMT_CREDIT_SUM_DEBT_SUM | float64 | 0.05 | 0.11 | 0.03 |
| 32 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN | float64 | 0.05 | 0.11 | 0.02 |
| 34 | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | float64 | 0.04 | 0.10 | 0.12 |
| 35 | POS_CASH_MONTHS_BALANCE_MIN | float64 | 0.04 | 0.10 | 0.12 |
| 36 | DAYS_ID_PUBLISH | int64 | 0.04 | 0.10 | 0.11 |
| 37 | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | float64 | 0.04 | 0.10 | 0.10 |
| 38 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | float64 | 0.04 | 0.10 | 0.10 |
| 39 | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | float64 | 0.04 | 0.10 | 0.10 |
| 41 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | float64 | 0.04 | 0.10 | 0.09 |
| 42 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | float64 | 0.04 | 0.10 | 0.08 |
| 43 | AMT_CREDIT | float64 | 0.04 | 0.10 | 0.04 |
| 44 | CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN | float64 | 0.04 | 0.10 | 0.04 |
| 45 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD | float64 | 0.04 | 0.10 | 0.04 |
| 46 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN | float64 | 0.04 | 0.10 | 0.04 |
| 47 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM | float64 | 0.04 | 0.10 | 0.04 |
| 48 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX | float64 | 0.04 | 0.10 | 0.04 |
| 49 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM | float64 | 0.04 | 0.10 | 0.03 |
| 50 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX | float64 | 0.04 | 0.10 | 0.03 |
| 51 | BUREAU_DAYS_CREDIT_ENDDATE_MAX | float64 | 0.04 | 0.10 | 0.02 |
| 52 | BUREAU_AMT_CREDIT_SUM_LIMIT_SUM | float64 | 0.04 | 0.10 | 0.01 |
| 53 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | float64 | 0.04 | 0.09 | 0.09 |
| 54 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR | float64 | 0.04 | 0.09 | 0.05 |
| 55 | POS_CASH_MONTHS_BALANCE_MEAN | float64 | 0.03 | 0.09 | 0.09 |
| 56 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | float64 | 0.03 | 0.09 | 0.09 |
| 57 | REG_CITY_NOT_WORK_CITY | int64 | 0.03 | 0.09 | 0.08 |
| 58 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | float64 | 0.03 | 0.09 | 0.08 |
| 59 | FLAG_EMP_PHONE | int64 | 0.03 | 0.09 | 0.07 |
| 60 | POS_CASH_APP_COUNT | float64 | 0.03 | 0.09 | 0.07 |
| 61 | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | float64 | 0.03 | 0.09 | 0.06 |
| 62 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | float64 | 0.03 | 0.09 | 0.06 |
| 63 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | float64 | 0.03 | 0.09 | 0.06 |
| 64 | CREDIT_CARD_BALANCE_AMT_BALANCE_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 65 | CREDIT_CARD_BALANCE_AMT_BALANCE_STD | float64 | 0.03 | 0.09 | 0.04 |
| 66 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 67 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 68 | CREDIT_CARD_BALANCE_AMT_BALANCE_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 69 | CREDIT_CARD_BALANCE_AMT_BALANCE_MIN | float64 | 0.03 | 0.09 | 0.03 |
| 70 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 71 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX | float64 | 0.03 | 0.09 | 0.03 |
| 72 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 73 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN | float64 | 0.03 | 0.09 | 0.03 |
| 74 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | float64 | 0.03 | 0.08 | 0.09 |
| 75 | DAYS_REGISTRATION | float64 | 0.03 | 0.08 | 0.08 |
| 76 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | float64 | 0.03 | 0.08 | 0.08 |
| 77 | FLAG_DOCUMENT_3 | int64 | 0.03 | 0.08 | 0.07 |
| 78 | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | float64 | 0.03 | 0.08 | 0.07 |
| 79 | REGION_POPULATION_RELATIVE | float64 | 0.03 | 0.08 | 0.06 |
| 80 | POS_CASH_SK_DPD_DEF_MEAN | float64 | 0.03 | 0.08 | 0.06 |
| 81 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL | float64 | 0.03 | 0.08 | 0.05 |
| 83 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD | float64 | 0.03 | 0.08 | 0.04 |
| 84 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN | float64 | 0.03 | 0.08 | 0.03 |
| 85 | AMT_ANNUITY | float64 | 0.03 | 0.08 | 0.00 |
| 86 | BUREAU_AMT_CREDIT_SUM_MEAN | float64 | 0.02 | 0.08 | 0.07 |
| 87 | POS_CASH_SK_DPD_DEF_MAX | float64 | 0.02 | 0.08 | 0.06 |
| 88 | REG_CITY_NOT_LIVE_CITY | int64 | 0.02 | 0.08 | 0.05 |
| 89 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN | float64 | 0.02 | 0.08 | 0.05 |
| 90 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN | float64 | 0.02 | 0.08 | 0.05 |
| 91 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN | float64 | 0.02 | 0.08 | 0.03 |
| 92 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | float64 | 0.02 | 0.08 | 0.02 |
| 93 | BUREAU_AMT_CREDIT_SUM_SUM | float64 | 0.02 | 0.07 | 0.07 |
| 94 | POS_CASH_SK_DPD_MAX | float64 | 0.02 | 0.07 | 0.06 |
| 95 | POS_CASH_SK_DPD_MEAN | float64 | 0.02 | 0.07 | 0.06 |
| 96 | BUREAU_BALANCE_STATUS_C | float64 | 0.02 | 0.07 | 0.05 |
| 97 | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | float64 | 0.02 | 0.07 | 0.05 |
| 98 | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | float64 | 0.02 | 0.07 | 0.05 |
| 99 | PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN | float64 | 0.02 | 0.07 | 0.04 |
| 100 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN | float64 | 0.02 | 0.07 | 0.04 |
| 101 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX | float64 | 0.02 | 0.07 | 0.04 |
| 102 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED | float64 | 0.02 | 0.07 | 0.04 |
| 104 | BUREAU_CREDIT_DAY_OVERDUE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 105 | BUREAU_CREDIT_DAY_OVERDUE_SUM | float64 | 0.02 | 0.07 | 0.03 |
| 106 | BUREAU_CREDIT_DAY_OVERDUE_MAX | float64 | 0.02 | 0.07 | 0.03 |
| 107 | BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 108 | BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM | float64 | 0.02 | 0.07 | 0.03 |
| 109 | PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 110 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED | float64 | 0.02 | 0.07 | 0.03 |
| 111 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION | float64 | 0.02 | 0.07 | 0.03 |
| 112 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN | float64 | 0.02 | 0.07 | 0.03 |
| 113 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 114 | AMT_REQ_CREDIT_BUREAU_YEAR | float64 | 0.02 | 0.07 | 0.01 |
| 115 | BUREAU_BALANCE_STATUS_1 | float64 | 0.02 | 0.07 | 0.01 |
| 116 | PREVIOUS_APPLICATION_AMT_APPLICATION_MAX | float64 | 0.02 | 0.07 | 0.01 |
| 117 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX | float64 | 0.02 | 0.07 | 0.01 |
| 118 | CREDIT_CARD_BALANCE_STATUS_Active | float64 | 0.02 | 0.07 | 0.01 |
| 119 | PREVIOUS_APPLICATION_AMT_CREDIT_MAX | float64 | 0.02 | 0.07 | 0.00 |
| 120 | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | float64 | 0.02 | 0.06 | 0.07 |
| 121 | POS_CASH_CNT_INSTALMENT_MEAN | float64 | 0.02 | 0.06 | 0.04 |
| 122 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | float64 | 0.02 | 0.06 | 0.04 |
| 123 | BUREAU_BALANCE_STATUS_0 | float64 | 0.02 | 0.06 | 0.03 |
| 124 | BUREAU_BALANCE_STATUS_X | float64 | 0.02 | 0.06 | 0.03 |
| 125 | PREVIOUS_APPLICATION_AMT_ANNUITY_MAX | float64 | 0.02 | 0.06 | 0.03 |
| 126 | PREVIOUS_APPLICATION_AMT_CREDIT_MIN | float64 | 0.02 | 0.06 | 0.03 |
| 127 | PREVIOUS_APPLICATION_AMT_APPLICATION_MIN | float64 | 0.02 | 0.06 | 0.02 |
| 128 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN | float64 | 0.02 | 0.06 | 0.02 |
| 129 | PREVIOUS_APPLICATION_DAYS_DECISION_MAX | float64 | 0.01 | 0.06 | 0.06 |
| 130 | LIVE_CITY_NOT_WORK_CITY | int64 | 0.01 | 0.06 | 0.05 |
| 131 | POS_CASH_CNT_INSTALMENT_MIN | float64 | 0.01 | 0.06 | 0.05 |
| 132 | DEF_30_CNT_SOCIAL_CIRCLE | float64 | 0.01 | 0.06 | 0.04 |
| 133 | BUREAU_BALANCE_STATUS_3 | float64 | 0.01 | 0.06 | 0.04 |
| 134 | BUREAU_BALANCE_STATUS_4 | float64 | 0.01 | 0.06 | 0.04 |
| 135 | BUREAU_BALANCE_STATUS_5 | float64 | 0.01 | 0.06 | 0.04 |
| 136 | FLAG_DOCUMENT_6 | int64 | 0.01 | 0.06 | 0.03 |
| 137 | BUREAU_BALANCE_STATUS_2 | float64 | 0.01 | 0.06 | 0.03 |
| 138 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS | float64 | 0.01 | 0.06 | 0.03 |
| 141 | PREVIOUS_APPLICATION_AMT_ANNUITY_SUM | float64 | 0.01 | 0.06 | 0.02 |
| 142 | PREVIOUS_APPLICATION_AMT_CREDIT_MEAN | float64 | 0.01 | 0.06 | 0.02 |
| 143 | POS_CASH_CNT_INSTALMENT_MAX | float64 | 0.01 | 0.06 | 0.02 |
| 144 | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | float64 | 0.01 | 0.06 | 0.02 |
| 145 | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | float64 | 0.01 | 0.06 | 0.02 |
| 146 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN | float64 | 0.01 | 0.06 | 0.02 |
| 147 | PREVIOUS_APPLICATION_AMT_CREDIT_SUM | float64 | 0.01 | 0.06 | 0.01 |
| 148 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM | float64 | 0.01 | 0.06 | 0.01 |
| 149 | PREVIOUS_APPLICATION_AMT_APPLICATION_SUM | float64 | 0.01 | 0.06 | 0.00 |
| 150 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM | float64 | 0.01 | 0.06 | 0.00 |
| 151 | HOUR_APPR_PROCESS_START | int64 | 0.01 | 0.05 | 0.05 |
| 152 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS | float64 | 0.01 | 0.05 | 0.05 |
| 153 | AMT_INCOME_TOTAL | float64 | 0.01 | 0.05 | 0.04 |
| 154 | FLAG_WORK_PHONE | int64 | 0.01 | 0.05 | 0.04 |
| 155 | BUREAU_CNT_CREDIT_PROLONG_MEAN | float64 | 0.01 | 0.05 | 0.04 |
| 156 | BUREAU_CNT_CREDIT_PROLONG_SUM | float64 | 0.01 | 0.05 | 0.04 |
| 157 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED | float64 | 0.01 | 0.05 | 0.04 |
| 159 | DEF_60_CNT_SOCIAL_CIRCLE | float64 | 0.01 | 0.05 | 0.03 |
| 160 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY | float64 | 0.01 | 0.05 | 0.03 |
| 161 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY | float64 | 0.01 | 0.05 | 0.03 |
| 162 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY | float64 | 0.01 | 0.05 | 0.03 |
| 163 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY | float64 | 0.01 | 0.05 | 0.03 |
| 164 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY | float64 | 0.01 | 0.05 | 0.03 |
| 165 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER | float64 | 0.01 | 0.05 | 0.03 |
| 166 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO | float64 | 0.01 | 0.05 | 0.03 |
| 167 | POS_CASH_CNT_INSTALMENT_SUM | float64 | 0.01 | 0.05 | 0.03 |
| 168 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP | float64 | 0.01 | 0.05 | 0.02 |
| 169 | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | float64 | 0.01 | 0.05 | 0.02 |
| 170 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY | float64 | 0.01 | 0.05 | 0.01 |
| 171 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA | float64 | 0.01 | 0.05 | 0.01 |
| 172 | POS_CASH_MONTHS_BALANCE_MAX | float64 | 0.01 | 0.05 | 0.01 |
| 173 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE | float64 | 0.01 | 0.05 | 0.00 |
numeric_cols.shape # Zgadza się, powinno być 162 kolumn numerycznych i jest
(162, 6)
# Definiowanie warunków (sprawdzono wszystkie zmienne, gdzie IV <0,05 i jednocześnie Gini <0,05 i zdecydowano o przyjęciu takich kryteriów usuniecia)
condition = (numeric_cols['IV'] <= 0.05) & (numeric_cols['Gini'] <= 0.05)
# Filtrowanie DataFrame, aby znaleźć rekordy spełniające te dwa warunki
filtered_rows = numeric_cols[condition]
filtered_rows[['Zmienna', 'Data Type', 'IV', 'V Cramera', 'Gini']]
| Zmienna | Data Type | IV | V Cramera | Gini | |
|---|---|---|---|---|---|
| 21 | BUREAU_AMT_CREDIT_SUM_DEBT_MEAN | float64 | 0.05 | 0.12 | 0.04 |
| 28 | BUREAU_DAYS_CREDIT_UPDATE_MAX | float64 | 0.05 | 0.11 | 0.04 |
| 29 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MEAN | float64 | 0.05 | 0.11 | 0.04 |
| 30 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MAX | float64 | 0.05 | 0.11 | 0.04 |
| 31 | BUREAU_AMT_CREDIT_SUM_DEBT_SUM | float64 | 0.05 | 0.11 | 0.03 |
| 32 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MEAN | float64 | 0.05 | 0.11 | 0.02 |
| 43 | AMT_CREDIT | float64 | 0.04 | 0.10 | 0.04 |
| 44 | CREDIT_CARD_BALANCE_AMT_BALANCE_MEAN | float64 | 0.04 | 0.10 | 0.04 |
| 45 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_STD | float64 | 0.04 | 0.10 | 0.04 |
| 46 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MEAN | float64 | 0.04 | 0.10 | 0.04 |
| 47 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_SUM | float64 | 0.04 | 0.10 | 0.04 |
| 48 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MAX | float64 | 0.04 | 0.10 | 0.04 |
| 49 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_SUM | float64 | 0.04 | 0.10 | 0.03 |
| 50 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MAX | float64 | 0.04 | 0.10 | 0.03 |
| 51 | BUREAU_DAYS_CREDIT_ENDDATE_MAX | float64 | 0.04 | 0.10 | 0.02 |
| 52 | BUREAU_AMT_CREDIT_SUM_LIMIT_SUM | float64 | 0.04 | 0.10 | 0.01 |
| 54 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCOFR | float64 | 0.04 | 0.09 | 0.05 |
| 64 | CREDIT_CARD_BALANCE_AMT_BALANCE_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 65 | CREDIT_CARD_BALANCE_AMT_BALANCE_STD | float64 | 0.03 | 0.09 | 0.04 |
| 66 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 67 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MAX | float64 | 0.03 | 0.09 | 0.04 |
| 68 | CREDIT_CARD_BALANCE_AMT_BALANCE_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 69 | CREDIT_CARD_BALANCE_AMT_BALANCE_MIN | float64 | 0.03 | 0.09 | 0.03 |
| 70 | CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 71 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MAX | float64 | 0.03 | 0.09 | 0.03 |
| 72 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_SUM | float64 | 0.03 | 0.09 | 0.03 |
| 73 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_MIN | float64 | 0.03 | 0.09 | 0.03 |
| 81 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_NORMAL | float64 | 0.03 | 0.08 | 0.05 |
| 83 | CREDIT_CARD_BALANCE_AMT_TOTAL_RECEIVABLE_STD | float64 | 0.03 | 0.08 | 0.04 |
| 84 | CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MEAN | float64 | 0.03 | 0.08 | 0.03 |
| 85 | AMT_ANNUITY | float64 | 0.03 | 0.08 | 0.00 |
| 88 | REG_CITY_NOT_LIVE_CITY | int64 | 0.02 | 0.08 | 0.05 |
| 89 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MEAN | float64 | 0.02 | 0.08 | 0.05 |
| 90 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MIN | float64 | 0.02 | 0.08 | 0.05 |
| 91 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MIN | float64 | 0.02 | 0.08 | 0.03 |
| 92 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MAX | float64 | 0.02 | 0.08 | 0.02 |
| 96 | BUREAU_BALANCE_STATUS_C | float64 | 0.02 | 0.07 | 0.05 |
| 97 | POS_CASH_CNT_INSTALMENT_FUTURE_MEAN | float64 | 0.02 | 0.07 | 0.05 |
| 98 | POS_CASH_NAME_CONTRACT_STATUS_COMPLETED | float64 | 0.02 | 0.07 | 0.05 |
| 99 | PREVIOUS_APPLICATION_AMT_ANNUITY_MEAN | float64 | 0.02 | 0.07 | 0.04 |
| 100 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MIN | float64 | 0.02 | 0.07 | 0.04 |
| 101 | PREVIOUS_APPLICATION_HOUR_APPR_PROCESS_START_MAX | float64 | 0.02 | 0.07 | 0.04 |
| 102 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_APPROVED | float64 | 0.02 | 0.07 | 0.04 |
| 104 | BUREAU_CREDIT_DAY_OVERDUE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 105 | BUREAU_CREDIT_DAY_OVERDUE_SUM | float64 | 0.02 | 0.07 | 0.03 |
| 106 | BUREAU_CREDIT_DAY_OVERDUE_MAX | float64 | 0.02 | 0.07 | 0.03 |
| 107 | BUREAU_AMT_CREDIT_SUM_OVERDUE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 108 | BUREAU_AMT_CREDIT_SUM_OVERDUE_SUM | float64 | 0.02 | 0.07 | 0.03 |
| 109 | PREVIOUS_APPLICATION_AMT_APPLICATION_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 110 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REFRESHED | float64 | 0.02 | 0.07 | 0.03 |
| 111 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_LOW_ACTION | float64 | 0.02 | 0.07 | 0.03 |
| 112 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MIN | float64 | 0.02 | 0.07 | 0.03 |
| 113 | CREDIT_CARD_BALANCE_MONTHS_BALANCE_MEAN | float64 | 0.02 | 0.07 | 0.03 |
| 114 | AMT_REQ_CREDIT_BUREAU_YEAR | float64 | 0.02 | 0.07 | 0.01 |
| 115 | BUREAU_BALANCE_STATUS_1 | float64 | 0.02 | 0.07 | 0.01 |
| 116 | PREVIOUS_APPLICATION_AMT_APPLICATION_MAX | float64 | 0.02 | 0.07 | 0.01 |
| 117 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MAX | float64 | 0.02 | 0.07 | 0.01 |
| 118 | CREDIT_CARD_BALANCE_STATUS_Active | float64 | 0.02 | 0.07 | 0.01 |
| 119 | PREVIOUS_APPLICATION_AMT_CREDIT_MAX | float64 | 0.02 | 0.07 | 0.00 |
| 121 | POS_CASH_CNT_INSTALMENT_MEAN | float64 | 0.02 | 0.06 | 0.04 |
| 122 | INSTALLMENTS_PAYMENTS_PAYMENT_CHANGE_MEAN | float64 | 0.02 | 0.06 | 0.04 |
| 123 | BUREAU_BALANCE_STATUS_0 | float64 | 0.02 | 0.06 | 0.03 |
| 124 | BUREAU_BALANCE_STATUS_X | float64 | 0.02 | 0.06 | 0.03 |
| 125 | PREVIOUS_APPLICATION_AMT_ANNUITY_MAX | float64 | 0.02 | 0.06 | 0.03 |
| 126 | PREVIOUS_APPLICATION_AMT_CREDIT_MIN | float64 | 0.02 | 0.06 | 0.03 |
| 127 | PREVIOUS_APPLICATION_AMT_APPLICATION_MIN | float64 | 0.02 | 0.06 | 0.02 |
| 128 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_MEAN | float64 | 0.02 | 0.06 | 0.02 |
| 130 | LIVE_CITY_NOT_WORK_CITY | int64 | 0.01 | 0.06 | 0.05 |
| 131 | POS_CASH_CNT_INSTALMENT_MIN | float64 | 0.01 | 0.06 | 0.05 |
| 132 | DEF_30_CNT_SOCIAL_CIRCLE | float64 | 0.01 | 0.06 | 0.04 |
| 133 | BUREAU_BALANCE_STATUS_3 | float64 | 0.01 | 0.06 | 0.04 |
| 134 | BUREAU_BALANCE_STATUS_4 | float64 | 0.01 | 0.06 | 0.04 |
| 135 | BUREAU_BALANCE_STATUS_5 | float64 | 0.01 | 0.06 | 0.04 |
| 136 | FLAG_DOCUMENT_6 | int64 | 0.01 | 0.06 | 0.03 |
| 137 | BUREAU_BALANCE_STATUS_2 | float64 | 0.01 | 0.06 | 0.03 |
| 138 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS | float64 | 0.01 | 0.06 | 0.03 |
| 141 | PREVIOUS_APPLICATION_AMT_ANNUITY_SUM | float64 | 0.01 | 0.06 | 0.02 |
| 142 | PREVIOUS_APPLICATION_AMT_CREDIT_MEAN | float64 | 0.01 | 0.06 | 0.02 |
| 143 | POS_CASH_CNT_INSTALMENT_MAX | float64 | 0.01 | 0.06 | 0.02 |
| 144 | POS_CASH_CNT_INSTALMENT_FUTURE_MAX | float64 | 0.01 | 0.06 | 0.02 |
| 145 | INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE | float64 | 0.01 | 0.06 | 0.02 |
| 146 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN | float64 | 0.01 | 0.06 | 0.02 |
| 147 | PREVIOUS_APPLICATION_AMT_CREDIT_SUM | float64 | 0.01 | 0.06 | 0.01 |
| 148 | CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM | float64 | 0.01 | 0.06 | 0.01 |
| 149 | PREVIOUS_APPLICATION_AMT_APPLICATION_SUM | float64 | 0.01 | 0.06 | 0.00 |
| 150 | PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM | float64 | 0.01 | 0.06 | 0.00 |
| 151 | HOUR_APPR_PROCESS_START | int64 | 0.01 | 0.05 | 0.05 |
| 152 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS | float64 | 0.01 | 0.05 | 0.05 |
| 153 | AMT_INCOME_TOTAL | float64 | 0.01 | 0.05 | 0.04 |
| 154 | FLAG_WORK_PHONE | int64 | 0.01 | 0.05 | 0.04 |
| 155 | BUREAU_CNT_CREDIT_PROLONG_MEAN | float64 | 0.01 | 0.05 | 0.04 |
| 156 | BUREAU_CNT_CREDIT_PROLONG_SUM | float64 | 0.01 | 0.05 | 0.04 |
| 157 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED | float64 | 0.01 | 0.05 | 0.04 |
| 159 | DEF_60_CNT_SOCIAL_CIRCLE | float64 | 0.01 | 0.05 | 0.03 |
| 160 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY | float64 | 0.01 | 0.05 | 0.03 |
| 161 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY | float64 | 0.01 | 0.05 | 0.03 |
| 162 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY | float64 | 0.01 | 0.05 | 0.03 |
| 163 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY | float64 | 0.01 | 0.05 | 0.03 |
| 164 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY | float64 | 0.01 | 0.05 | 0.03 |
| 165 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER | float64 | 0.01 | 0.05 | 0.03 |
| 166 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO | float64 | 0.01 | 0.05 | 0.03 |
| 167 | POS_CASH_CNT_INSTALMENT_SUM | float64 | 0.01 | 0.05 | 0.03 |
| 168 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP | float64 | 0.01 | 0.05 | 0.02 |
| 169 | POS_CASH_CNT_INSTALMENT_FUTURE_SUM | float64 | 0.01 | 0.05 | 0.02 |
| 170 | PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY | float64 | 0.01 | 0.05 | 0.01 |
| 171 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA | float64 | 0.01 | 0.05 | 0.01 |
| 172 | POS_CASH_MONTHS_BALANCE_MAX | float64 | 0.01 | 0.05 | 0.01 |
| 173 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE | float64 | 0.01 | 0.05 | 0.00 |
var_to_drop = filtered_rows['Zmienna'].tolist()
len(var_to_drop)
108
df_train.drop(columns=var_to_drop, inplace=True) #usuwam te wszystkie 108 kolumn
df_test.drop(columns=var_to_drop, inplace=True)
df_all.drop(columns=var_to_drop, inplace=True)
df_train.shape # pozostało 72 kolumny OK
(452394, 72)
df_test.shape
(61503, 72)
df_all.shape
(513897, 73)
Zbadanie wskaźnika IV¶
Decyduję się na usunięcie wszystkie zmiennych, które mają wskaźnik IV mniejszy od 0,02, nawet jeżeli Gini wynosi więcej niż 0,05
condition = (ranking_predyktorow['IV'] < 0.02)
filtered_rows = ranking_predyktorow[condition]
var_to_drop = filtered_rows['Zmienna'].tolist()
var_to_drop
['PREVIOUS_APPLICATION_DAYS_DECISION_MAX', 'LIVE_CITY_NOT_WORK_CITY', 'POS_CASH_CNT_INSTALMENT_MIN', 'DEF_30_CNT_SOCIAL_CIRCLE', 'BUREAU_BALANCE_STATUS_3', 'BUREAU_BALANCE_STATUS_4', 'BUREAU_BALANCE_STATUS_5', 'FLAG_DOCUMENT_6', 'BUREAU_BALANCE_STATUS_2', 'PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CONSUMER LOANS', 'NAME_CONTRACT_TYPE', 'NAME_HOUSING_TYPE', 'PREVIOUS_APPLICATION_AMT_ANNUITY_SUM', 'PREVIOUS_APPLICATION_AMT_CREDIT_MEAN', 'POS_CASH_CNT_INSTALMENT_MAX', 'POS_CASH_CNT_INSTALMENT_FUTURE_MAX', 'INSTALLMENTS_PAYMENTS_NUM_INSTALMENT_VERSION_NUNIQUE', 'CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_MIN', 'PREVIOUS_APPLICATION_AMT_CREDIT_SUM', 'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_SUM', 'PREVIOUS_APPLICATION_AMT_APPLICATION_SUM', 'PREVIOUS_APPLICATION_AMT_GOODS_PRICE_SUM', 'PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_CASH_LOAN_PURPOSE', 'HOUR_APPR_PROCESS_START', 'PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_CASH LOANS', 'AMT_INCOME_TOTAL', 'FLAG_WORK_PHONE', 'BUREAU_CNT_CREDIT_PROLONG_MEAN', 'BUREAU_CNT_CREDIT_PROLONG_SUM', 'PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_CANCELED', 'BUREAU_MOST_FREQ_CREDIT_CURRENCY', 'DEF_60_CNT_SOCIAL_CIRCLE', 'PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_FRIDAY', 'PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_MONDAY', 'PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_THURSDAY', 'PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_TUESDAY', 'PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_WEDNESDAY', 'PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_REPEATER', 'PREVIOUS_APPLICATION_CODE_REJECT_REASON_SCO', 'POS_CASH_CNT_INSTALMENT_SUM', 'PREVIOUS_APPLICATION_CODE_REJECT_REASON_XAP', 'POS_CASH_CNT_INSTALMENT_FUTURE_SUM', 'PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SATURDAY', 'PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_XNA', 'POS_CASH_MONTHS_BALANCE_MAX', 'PREVIOUS_APPLICATION_NAME_YIELD_GROUP_MIDDLE', 'CNT_CHILDREN', 'FLAG_PHONE', 'FLAG_OWN_CAR', 'POS_CASH_CNT_INSTALMENT_FUTURE_MIN', 'CREDIT_CARD_BALANCE_MONTHS_BALANCE_MAX', 'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MIN', 'CNT_FAM_MEMBERS', 'PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_XNA', 'PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_UNUSED OFFER', 'PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_XNA', 'PREVIOUS_APPLICATION_CODE_REJECT_REASON_CLIENT', 'PREVIOUS_APPLICATION_CODE_REJECT_REASON_SYSTEM', 'PREVIOUS_APPLICATION_CODE_REJECT_REASON_VERIF', 'PREVIOUS_APPLICATION_CODE_REJECT_REASON_XNA', 'CREDIT_CARD_BALANCE_AMT_DRAWINGS_CURRENT_MIN', 'CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MIN', 'CREDIT_CARD_BALANCE_SK_DPD_MAX', 'CREDIT_CARD_BALANCE_SK_DPD_MEAN', 'CREDIT_CARD_BALANCE_SK_DPD_DEF_MAX', 'CREDIT_CARD_BALANCE_STATUS_Completed', 'PREVIOUS_APPLICATION_PREV_APPS_COUNT', 'PREVIOUS_APPLICATION_WEEKDAY_APPR_PROCESS_START_SUNDAY', 'PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_X-SELL', 'CREDIT_CARD_BALANCE_SK_DPD_DEF_SUM', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'POS_CASH_NAME_CONTRACT_STATUS_RETURNED TO THE STORE', 'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MEAN', 'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_MAX', 'CREDIT_CARD_BALANCE_AMT_CREDIT_LIMIT_ACTUAL_STD', 'CREDIT_CARD_BALANCE_AMT_PAYMENT_TOTAL_CURRENT_MIN', 'CREDIT_CARD_BALANCE_SK_DPD_SUM', 'CREDIT_CARD_BALANCE_SK_DPD_DEF_MEAN', 'CREDIT_CARD_BALANCE_STATUS_Approved', 'CREDIT_CARD_BALANCE_STATUS_Demand', 'CREDIT_CARD_BALANCE_STATUS_Refused', 'CREDIT_CARD_BALANCE_STATUS_Sent proposal', 'CREDIT_CARD_BALANCE_STATUS_Signed', 'POS_CASH_NAME_CONTRACT_STATUS_AMORTIZED DEBT', 'POS_CASH_NAME_CONTRACT_STATUS_APPROVED', 'POS_CASH_NAME_CONTRACT_STATUS_CANCELED', 'POS_CASH_NAME_CONTRACT_STATUS_DEMAND', 'POS_CASH_NAME_CONTRACT_STATUS_SIGNED', 'POS_CASH_NAME_CONTRACT_STATUS_XNA', 'NAME_TYPE_SUITE', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_16', 'REG_REGION_NOT_WORK_REGION', 'FLAG_OWN_REALTY', 'REG_REGION_NOT_LIVE_REGION', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_21', 'WEEKDAY_APPR_PROCESS_START', 'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'FLAG_EMAIL', 'LIVE_REGION_NOT_WORK_REGION', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20']
Usuwam tylko te kolumny, których nie usunąłem wcześniej
# Usuń kolumny z df_train, jeśli istnieją
df_train.drop(columns=[col for col in var_to_drop if col in df_train.columns], inplace=True)
# Usuń kolumny z df_test, jeśli istnieją
df_test.drop(columns=[col for col in var_to_drop if col in df_test.columns], inplace=True)
# Usuń kolumny z df_all, jeśli istnieją
df_all.drop(columns=[col for col in var_to_drop if col in df_all.columns], inplace=True)
df_train.shape
(452394, 68)
df_test.shape
(61503, 68)
df_all.shape
(513897, 69)
Jak widać takich kolumn było tylko 4
# Dodatkowe usunięcie zmiennej FLAG_DOCUMENT_12, ponieważ z jakiegoś powodu nie została ona uwzględniona w rankingu predyktorów ale posiada bardzo niskie wartości IV
var_to_drop = ['FLAG_DOCUMENT_12']
df_train.drop(columns=var_to_drop, inplace=True) #usuwam te wszystkie 108 kolumn
df_test.drop(columns=var_to_drop, inplace=True)
df_all.drop(columns=var_to_drop, inplace=True)
df_train.shape
(452394, 67)
df_test.shape
(61503, 67)
df_all.shape
(513897, 68)
Sprawdzenie wyników wstępnej redukcji zmiennych¶
# Sprawdzenie jakie zmienne pozostały w zbiorze danych
results = []
total_rows = len(df_train) # Liczba wierszy w tabeli
# Iteracja po każdej kolumnie
for column in df_train:
unique_values = df_train[column].nunique() # Liczba unikatowych wartości w konkretnej kolumnie
data_type = df_train[column].dtype # Typ danych
null_count = df_train[column].isnull().sum() # Liczba nulli
null_percent = (null_count / total_rows) * 100 # % nulli
# Warunki potrzebne do wyświetlania unikalnych wartości
# Jeżeli dla zmiennej jest ich mak 5 to pokazuje bez róznicy na typ danych
if unique_values <= 5:
values_to_display = df_train[column].unique()
# Jeżeli kolumna ma typ nieliczbowy to nawet powyżej 5 unikatów pokazuj wszystkie
elif df_train[column].dtype == 'object':
values_to_display = df_train[column].unique()
# Dla typów liczbowych powyżej 5 unikatów pokazuj poniższy komunikat
else:
values_to_display = f"> 5 unikatowych wartości liczbowych"
# Dodatwanie wyników do listy
results.append({
'Variable': column,
'Data Type': data_type,
'Unique Values Count': unique_values,
'Unique Values': values_to_display,
'Null Values Count': null_count,
'Null Values %': null_percent
})
# Zamiana wyników na dataframe
results_df = pd.DataFrame(results)
predictors = pd.merge(ranking_predyktorow, results_df, how='inner', left_on='Zmienna', right_on='Variable')
predictors
predictors[['Zmienna', 'Data Type', 'IV', 'V Cramera', 'Gini']]
| Zmienna | Data Type | IV | V Cramera | Gini | |
|---|---|---|---|---|---|
| 0 | EXT_SOURCE_3 | float64 | 0.33 | 0.28 | 0.25 |
| 1 | EXT_SOURCE_2 | float64 | 0.32 | 0.27 | 0.31 |
| 2 | BUREAU_DAYS_CREDIT_MEAN | float64 | 0.12 | 0.17 | 0.10 |
| 3 | DAYS_BIRTH | int64 | 0.09 | 0.15 | 0.17 |
| 4 | AMT_GOODS_PRICE | float64 | 0.09 | 0.15 | 0.07 |
| 5 | OCCUPATION_TYPE | object | 0.09 | 0.15 | 0.07 |
| 6 | DAYS_EMPLOYED | int64 | 0.08 | 0.14 | 0.07 |
| 7 | BUREAU_DAYS_CREDIT_MAX | float64 | 0.08 | 0.14 | 0.07 |
| 8 | BUREAU_DAYS_CREDIT_MIN | float64 | 0.08 | 0.14 | 0.07 |
| 9 | ORGANIZATION_TYPE | object | 0.08 | 0.14 | 0.05 |
| 10 | BUREAU_MOST_FREQ_CREDIT_ACTIVE | object | 0.07 | 0.13 | 0.13 |
| 11 | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | float64 | 0.07 | 0.13 | 0.06 |
| 12 | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | float64 | 0.06 | 0.12 | 0.13 |
| 13 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | float64 | 0.06 | 0.12 | 0.12 |
| 14 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | float64 | 0.06 | 0.12 | 0.11 |
| 15 | NAME_INCOME_TYPE | object | 0.06 | 0.12 | 0.10 |
| 16 | BUREAU_DAYS_CREDIT_UPDATE_MIN | float64 | 0.06 | 0.12 | 0.06 |
| 17 | BUREAU_DAYS_CREDIT_ENDDATE_MIN | float64 | 0.06 | 0.12 | 0.05 |
| 18 | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | float64 | 0.06 | 0.12 | 0.04 |
| 19 | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | float64 | 0.06 | 0.12 | 0.04 |
| 20 | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | float64 | 0.06 | 0.12 | 0.02 |
| 21 | DAYS_LAST_PHONE_CHANGE | int64 | 0.05 | 0.11 | 0.12 |
| 22 | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | float64 | 0.05 | 0.11 | 0.12 |
| 23 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | float64 | 0.05 | 0.11 | 0.11 |
| 24 | REGION_RATING_CLIENT_W_CITY | int64 | 0.05 | 0.11 | 0.10 |
| 25 | REGION_RATING_CLIENT | int64 | 0.05 | 0.11 | 0.09 |
| 26 | NAME_EDUCATION_TYPE | object | 0.05 | 0.11 | 0.08 |
| 27 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | object | 0.05 | 0.11 | 0.02 |
| 28 | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | float64 | 0.04 | 0.10 | 0.12 |
| 29 | POS_CASH_MONTHS_BALANCE_MIN | float64 | 0.04 | 0.10 | 0.12 |
| 30 | DAYS_ID_PUBLISH | int64 | 0.04 | 0.10 | 0.11 |
| 31 | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | float64 | 0.04 | 0.10 | 0.10 |
| 32 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | float64 | 0.04 | 0.10 | 0.10 |
| 33 | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | float64 | 0.04 | 0.10 | 0.10 |
| 34 | CODE_GENDER | object | 0.04 | 0.10 | 0.10 |
| 35 | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | float64 | 0.04 | 0.10 | 0.09 |
| 36 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | float64 | 0.04 | 0.10 | 0.08 |
| 37 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | float64 | 0.04 | 0.09 | 0.09 |
| 38 | POS_CASH_MONTHS_BALANCE_MEAN | float64 | 0.03 | 0.09 | 0.09 |
| 39 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | float64 | 0.03 | 0.09 | 0.09 |
| 40 | REG_CITY_NOT_WORK_CITY | int64 | 0.03 | 0.09 | 0.08 |
| 41 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | float64 | 0.03 | 0.09 | 0.08 |
| 42 | FLAG_EMP_PHONE | int64 | 0.03 | 0.09 | 0.07 |
| 43 | POS_CASH_APP_COUNT | float64 | 0.03 | 0.09 | 0.07 |
| 44 | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | float64 | 0.03 | 0.09 | 0.06 |
| 45 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | float64 | 0.03 | 0.09 | 0.06 |
| 46 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | float64 | 0.03 | 0.09 | 0.06 |
| 47 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY | object | 0.03 | 0.09 | 0.06 |
| 48 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | float64 | 0.03 | 0.08 | 0.09 |
| 49 | DAYS_REGISTRATION | float64 | 0.03 | 0.08 | 0.08 |
| 50 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | float64 | 0.03 | 0.08 | 0.08 |
| 51 | FLAG_DOCUMENT_3 | int64 | 0.03 | 0.08 | 0.07 |
| 52 | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | float64 | 0.03 | 0.08 | 0.07 |
| 53 | REGION_POPULATION_RELATIVE | float64 | 0.03 | 0.08 | 0.06 |
| 54 | POS_CASH_SK_DPD_DEF_MEAN | float64 | 0.03 | 0.08 | 0.06 |
| 55 | BUREAU_MOST_FREQ_CREDIT_TYPE | object | 0.03 | 0.08 | 0.05 |
| 56 | BUREAU_AMT_CREDIT_SUM_MEAN | float64 | 0.02 | 0.08 | 0.07 |
| 57 | POS_CASH_SK_DPD_DEF_MAX | float64 | 0.02 | 0.08 | 0.06 |
| 58 | BUREAU_AMT_CREDIT_SUM_SUM | float64 | 0.02 | 0.07 | 0.07 |
| 59 | POS_CASH_SK_DPD_MAX | float64 | 0.02 | 0.07 | 0.06 |
| 60 | POS_CASH_SK_DPD_MEAN | float64 | 0.02 | 0.07 | 0.06 |
| 61 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE | object | 0.02 | 0.07 | 0.06 |
| 62 | NAME_FAMILY_STATUS | object | 0.02 | 0.07 | 0.04 |
| 63 | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE | object | 0.02 | 0.07 | 0.00 |
| 64 | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | float64 | 0.02 | 0.06 | 0.07 |
#df_train.to_csv('df_train2.csv', index=False)
Wybór reprezentatów - Statistica¶
Wybór predyktorów przebiegł pomyślnie.
Teraz na ograniczonej liczbie zmiennych do 65 objaśniających skorzystam z modułu wyboru reprezentantów, który pozwala na identyfikację nadmiarowych (w sensie korelacji) zmiennych bez konieczności tworzenia i analizowania macierzy korelacji dla wszystkich zmiennych. Moduł ten tworzy wiązki skorelowanych ze sobą zmiennych na podstawi analizy czynnikowej (metodą ekstrakcji czynników jest metoda głównych składowych) z rotacą czynników, który jest realizowany za pomocą standardowej procedury STATISTICA. Wiąki zmiennych są tworzone w oparciu o wartości ładunków czynnikowych (korelacji pomiędzy aną zmienną i konkretnym czynnikiem). Użytkownik może określić minimalną wartość bezwzgędną ładunku powodującą, że dana zmienna będzie traktowana jako potencjalna reprezentanta danego czynnika. Liczba składowych jest określana na podstawie opcji Liczba czynników oraz artość własna.
Do analizy oczywiście uwzględnione są predyktory jakościowe, wiec przed wykonaniem analizy czynnikowej zmienne te są przekodowywane za pomocą transformacji WoE (opartej na logarytmie szansy modelowanego zjawiska), w takiej sytuacji konieczne jest zatem wskazanie dwuwartościowej zmiennej zależnej.
Po wykonanej analizie, zmienne znajdujące się w tej samej wiązce mocno (powyżej wartości określonej w opcji Min. ładunek) korelują z przypisanym tej wiązce czynnikiem i co istotne bardzo często są mocno skorelowane także między sobą. Spośród tych zmiennych, na podstawie oceny ich wzajemnej korelacji, stopnia powiązania ze zmienną zależną oraz wiedzy eksperckiej wybierane są zmienne do dalszej analizy. Możliwa jest także opcja automatycznego wyboru reprezentantów z wiązek na podstawie korelacji – wybieramy zmienne, które mają najwyższą średnią korelację z pozostałymi zmiennymi w wiązce. Drugą opcją automatycznego wyboru jest kryterium IV, wybieramy takich reprezentantów wiązki, którzy mają najmocniejsze powiązanie ze zmienną zależną.
Tak wygląda ta funkcjonalność w programie statistica:
Min. ładunek określa kryterium przypisywania poszczególnych zmiennych do wiązek reprezentowanych przez uzyskane czynniki. Wartość oznacza minimalną wartość bezwzględną ładunku (korelacji pomiędzy zmienną a czynnikiem), jaka kwalifikuje zmienną do wiązk Ustawiam na 0,7!.
Liczba czynników określa, jaka liczba czynników zostanie wyodrębniona. Opcja ta jest powiązana z opcją Wartość własna. Program wyodrębni taką liczbę czynników, jaka wynika z obydwóch kryteriów łącznie – liczba czynników będzie nie większa niż wartość parametru Liczba czynników natomiast wartość własna każdego z wyodrębnionych czynników będzie nie mniejsza niż wartoć parametru Wartość własna.
Obliczenia dla logarytmu szans transformuje predyktory na wartości WoE. Dzięki tej transformacji w analizie można wykorzystać zmienne jakościowe. Zmienne ilościowe przed transformacją zostaną podzielone na równoliczne kategori.e
Podobnie jak w przypadku wyboru predyktorów należy wybrać zmienne:
Po przeliczeniach otrzymujemy następującą tabelę:
Wczytany zbiór danych z miarami poniżej
reprezentanci = pd.read_excel('wybor reprezentantow.xlsx', engine='openpyxl')
reprezentanci
| Czynnik | Zmienna | Ładunek | IV | |
|---|---|---|---|---|
| 0 | 1 | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | 0.92 | 0.04 |
| 1 | 1 | POS_CASH_MONTHS_BALANCE_MIN | 0.89 | 0.04 |
| 2 | 1 | POS_CASH_MONTHS_BALANCE_MEAN | 0.84 | 0.03 |
| 3 | 1 | PREVIOUS_APPLICATION_DAYS_DECISION_MEAN | 0.82 | 0.04 |
| 4 | 2 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | 0.84 | 0.03 |
| 5 | 2 | PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN | 0.72 | 0.02 |
| 6 | 3 | FLAG_EMP_PHONE | 0.89 | 0.03 |
| 7 | 3 | DAYS_EMPLOYED | 0.89 | 0.03 |
| 8 | 4 | REGION_RATING_CLIENT | 0.95 | 0.05 |
| 9 | 4 | REGION_RATING_CLIENT_W_CITY | 0.95 | 0.05 |
| 10 | 4 | REGION_POPULATION_RELATIVE | 0.81 | 0.03 |
| 11 | 5 | POS_CASH_SK_DPD_DEF_MAX | 0.92 | 0.00 |
| 12 | 5 | POS_CASH_SK_DPD_MAX | 0.85 | 0.00 |
| 13 | 5 | POS_CASH_SK_DPD_DEF_MEAN | 0.79 | 0.00 |
| 14 | 6 | BUREAU_DAYS_CREDIT_MIN | 0.80 | 0.08 |
| 15 | 6 | BUREAU_DAYS_CREDIT_MEAN | 0.77 | 0.13 |
| 16 | 7 | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | -0.70 | 0.01 |
| 17 | 8 | INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM | -0.82 | 0.01 |
| 18 | 8 | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | -0.82 | 0.06 |
| 19 | 10 | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | 0.77 | 0.02 |
| 20 | 10 | BUREAU_DAYS_CREDIT_ENDDATE_MIN | 0.76 | 0.00 |
| 21 | 11 | CODE_GENDER | 0.71 | 0.04 |
| 22 | 14 | BUREAU_DAYS_CREDIT_MAX | 0.71 | 0.07 |
| 23 | 15 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN | 0.75 | 0.00 |
| 24 | 15 | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM | 0.71 | 0.01 |
W tym miejscu dodatkowo chciałbym sprawdzić, które zmienne mają największe korelacje między sobą.
Otóż usunięcie zmiennych z wysoką korelacją między sobą (nie między nimi a zmienną celu) jest ważne, ponieważ obecność skorelowanych zmiennych w modelu predykcyjnym może prowadzić do problemów z wieloliniowością, co z kolei osłabia interpretowalność modelu oraz może prowadzić do nadmiernego dopasowania. Redukcja takich zmiennych pomaga w poprawie stabilności i efektywności modelu, co jest kluczowe dla uzyskania wiarygodnych i generalizowalnych wyników predykcyjnych. W związku z tym decyduję się na usunięcie zmiennych, gdzie korelacja między nimi wynosi powyżej 0,9 - oczywiście z uwzględnieniem kontekstu biznesowego. Do usunięcia wybieram te zmienne, które mogą mieć: więcej braków danych, więcej wartości skrajnych, mniejsze znaczenie biznesowe i cięższą interpretacje.
#obliczenie korelacji między kolumnami numerycznymi
numeric_cols = df_train.select_dtypes(include=[np.number])
corr_matrix = numeric_cols.corr()
# Najwyższe korelacje między zmiennymi - taka informacja pomoże wyodrębnić kolumny do usunięcia
high_corr_threshold = 0.9
# Znajdowanie par zmiennych o wysokiej korelacji i tworzenie ramki danych do prezentacji
high_corr_pairs = []
for i in corr_matrix.columns:
for j in corr_matrix.index[corr_matrix.index > i]: # Unikać duplikatów i porównywania ze sobą
if abs(corr_matrix.at[i, j]) > high_corr_threshold:
high_corr_pairs.append([i, j, corr_matrix.at[i, j]])
# Konwersja listy na DataFrame dla lepszej prezentacji
high_corr_df = pd.DataFrame(high_corr_pairs, columns=['Zmienna 1', 'Zmienna 2', 'Correlation'])
high_corr_df
| Zmienna 1 | Zmienna 2 | Correlation | |
|---|---|---|---|
| 0 | DAYS_EMPLOYED | FLAG_EMP_PHONE | -1.00 |
| 1 | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | 0.95 |
| 2 | POS_CASH_MONTHS_BALANCE_MIN | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | 0.93 |
| 3 | POS_CASH_SK_DPD_MAX | POS_CASH_SK_DPD_MEAN | 0.92 |
| 4 | POS_CASH_SK_DPD_DEF_MAX | POS_CASH_SK_DPD_DEF_MEAN | 0.98 |
columns_to_drop = [
'POS_CASH_MONTHS_BALANCE_MIN', # czynnik 1, dodatkowo mocno skorelowana z PREVIOUS_APPLICATION_DAYS_DECISION_MIN
'POS_CASH_MONTHS_BALANCE_MEAN', # czynnik 1
'PREVIOUS_APPLICATION_DAYS_DECISION_MEAN', # czynnik 1
'PREVIOUS_APPLICATION_NAME_PRODUCT_TYPE_WALK-IN', # czynnik 2
'FLAG_EMP_PHONE', # czynnik 3, bardzo mocno skorelowana z DAYS_EMPLOYED
'REGION_RATING_CLIENT_W_CITY', # czynnik 4, zbyt duża korelacja z REGION_RATING_CLIENT
'REGION_POPULATION_RELATIVE', # czynnik 4
'POS_CASH_SK_DPD_DEF_MAX', # czynnik 5 IV =0
'POS_CASH_SK_DPD_MAX', # czynnik 5 IV =0
'POS_CASH_SK_DPD_DEF_MEAN', # czynnik 5 IV =0
'BUREAU_DAYS_CREDIT_MIN', # czynnik 6
'INSTALLMENTS_PAYMENTS_IS_DELAYED_SUM', # czynnik 8
'BUREAU_DAYS_CREDIT_ENDDATE_MIN', # czynnik 10
'INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MEAN', # czynnik 15, beznadziejne IV
'INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_SUM' # czynnik 15
]
df_train.drop(columns=columns_to_drop, inplace=True)
df_test.drop(columns=columns_to_drop, inplace=True)
df_all.drop(columns=columns_to_drop, inplace=True)
len(columns_to_drop)
15
# Usunięcie kolumny SK_ID_CURR
columns_to_drop = ['SK_ID_CURR']
df_train.drop(columns=columns_to_drop, inplace=True)
df_test.drop(columns=columns_to_drop, inplace=True)
df_all.drop(columns=columns_to_drop, inplace=True)
df_train.shape # pozostało 51 kolumn z czego 1 zmienna to target, czyli mamy 50 zmiennych objaśniających
(452394, 51)
df_test.shape
(61503, 51)
df_all.shape
(513897, 52)
Analiza zmiennych¶
Rozkład zmiennych ilościowych¶
# Pobieranie numerycznych kolumn
numeric_columns = df_train.select_dtypes(include=[np.number]).columns
# Ilość kolumn w gridzie wykresów
num_cols = 2
num_rows = (len(numeric_columns) + 1) // num_cols
fig, axs = plt.subplots(num_rows, num_cols, figsize=(15, 5 * num_rows))
# Przejście przez wszystkie numeryczne kolumny w DataFrame
for i, column in enumerate(numeric_columns):
row = i // num_cols
col = i % num_cols
sns.histplot(data=df_train, x=column, bins=50, hue=df_train['TARGET'], ax=axs[row, col], multiple="stack")
axs[row, col].set_title(f'Rozkład zmiennej - {column}')
axs[row, col].set_ylabel('Liczba')
# Jeżeli liczba subplotów jest nieparzysta, ukrywamy ostatni subplot, jeśli jest pusty
if len(numeric_columns) % num_cols != 0:
axs[-1, -1].axis('off')
plt.tight_layout()
plt.show()
* AMT_GOODS_PRICE -> Im przedmiot jest droższy, tym teoretycznie mniej złych klientów
* DAYS_BIRTH -> Im młodsi klienci, tym proporcjonalnie więcej targetowych 1, czyli złych klientów
* DAYS_EMPLOYED -> Zmienna powinna posiadać wyłącznie wartości ujemne, a istnieje bardzo duży odsetek obserwacji, które mają wartości kuriozalnie wysokie. Teoretycznie można byłoby przypisac im etykietę np. 'Nieznane', natoamiast nie ma to większego sensu, ponieważ przy wniosku aplikacyjnym, z którego to ta zmienna pochodzi, musi być podana ta informacja. Decyduję się na usunięcie tej zmiennej ze zbioru.
* EXT_SOURCE_2, EXT_SOURCE_3 -> Widać bardzo wyraźnie, że im mniejszy rating ma dany wnioskowdawca ze źródła zewnętrzengo tym większe prawdopodobieństwo, że okaże się złym klientem. Obie zmienne wyglądają na bardzo dobrze różnicujące złych i dobrych klientów.
* DAYS_LAST_PHONE_CHANGE -> Można zauważyć, że osoby które niedawno zmieniały telefon częsciej są złymi klientami. Z biznesowego punktu widzenia ma to sens.
* BUREAU_DAYS_CREDIT_MEAN oraz BUREAU_DAYS_CREDIT_MAX -> Zmienna ta mówi o tym, ile dni przed bieżącym złożeniem wniosku klient złożył wniosek o kredyt w BIK. Im świeższy wniosek od tej sytuacji, tym teoretycznie gorsi klienci.
* PREVIOUS_APPLICATION_DAY_DECISION_MEAN oraz ...SUM -> zmienna ta mówi jak dawno temu była podjęta decyzja o poprzednim kredycie. Okazuje się, że im krócej tym gorsi klienci
W przypadku pozostałych zmiennych ciężko jest doszukać się konkretnych zależności, pokazujących co charakteryzuje złych, a co dobrych klientów.
# Usunięcie zmiennej DAYS_EMPLOYED
columns_to_drop = ['DAYS_EMPLOYED']
df_train = df_train.drop(columns=columns_to_drop)
df_test = df_test.drop(columns=columns_to_drop)
df_all = df_all.drop(columns=columns_to_drop)
df_train.shape
(452394, 50)
df_test.shape
(61503, 50)
df_all.shape
(513897, 51)
Korelacje między zmiennymi ilościowymi¶
import plotly.graph_objects as go
numeric_df = df_train.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr()
labels = correlation_matrix.columns.tolist()
text = correlation_matrix.round(2).astype(str).values
# Stworzenie macierzy korelacji
fig = go.Figure(data=go.Heatmap(
z=correlation_matrix,
x=labels,
y=labels,
text=text,
hoverinfo="text",
colorscale='RdBu'))
# Dodanie tytułu i ustawienie rozmiaru
fig.update_layout(
title='Macierz korelacji',
autosize=False,
width=1200,
height=1500,
margin=dict(t=50, b=50, l=50, r=50)
)
fig.show()
Najwyższe korelacje między zmiennymi¶
high_corr_threshold = 0.5
# Znajdowanie par zmiennych o wysokiej korelacji i tworzenie ramki danych do prezentacji
high_corr_pairs = []
for i in correlation_matrix.columns:
for j in correlation_matrix.index[correlation_matrix.index > i]: # Unikać duplikatów i porównywania ze sobą
if abs(correlation_matrix.at[i, j]) > high_corr_threshold:
high_corr_pairs.append([i, j, correlation_matrix.at[i, j]])
# Konwersja listy na DataFrame dla lepszej prezentacji
high_corr_df = pd.DataFrame(high_corr_pairs, columns=['Zmienna 1', 'Zmienna 2', 'Correlation'])
# Sortowanie według wartości bezwzględnej korelacji
high_corr_df['Abs Correlation'] = high_corr_df['Correlation'].abs()
high_corr_df = high_corr_df.sort_values(by='Abs Correlation', ascending=False).drop('Abs Correlation', axis=1)
high_corr_df
| Zmienna 1 | Zmienna 2 | Correlation | |
|---|---|---|---|
| 7 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | 0.85 |
| 16 | POS_CASH_APP_COUNT | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | 0.84 |
| 14 | POS_CASH_APP_COUNT | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | -0.78 |
| 12 | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | -0.72 |
| 2 | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | 0.68 |
| 0 | BUREAU_DAYS_CREDIT_MAX | BUREAU_DAYS_CREDIT_MEAN | 0.67 |
| 1 | BUREAU_AMT_CREDIT_SUM_MEAN | BUREAU_AMT_CREDIT_SUM_SUM | 0.66 |
| 5 | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | 0.65 |
| 4 | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | -0.64 |
| 6 | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | 0.59 |
| 10 | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | -0.59 |
| 8 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | 0.57 |
| 15 | POS_CASH_APP_COUNT | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | 0.55 |
| 13 | POS_CASH_APP_COUNT | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | -0.52 |
| 3 | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | -0.51 |
| 11 | POS_CASH_NAME_CONTRACT_STATUS_ACTIVE | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | -0.51 |
| 9 | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | 0.50 |
# Usunięcie zmiennych z najwyższymi korelacjami
columns_to_drop = ['PREVIOUS_APPLICATION_CODE_REJECT_REASON_HC', 'POS_CASH_NAME_CONTRACT_STATUS_ACTIVE' ] # ...HC miało gorsze IV od REFUSED, ...ACTIVE nieznacznie gorsze od APPCOUNT
df_train = df_train.drop(columns=columns_to_drop)
df_test = df_test.drop(columns=columns_to_drop)
df_all = df_all.drop(columns=columns_to_drop)
df_train.shape
(452394, 48)
df_test.shape
(61503, 48)
df_all.shape
(513897, 49)
Rozkład zmiennych jakościowych¶
# Pobieranie kategorycznych kolumn
categorical_columns = df_train.select_dtypes(exclude=[np.number]).columns
df_train['TARGET'] = df_train['TARGET'].astype(str)
# Ilość kolumn w gridzie wykresów
num_cols = 2
num_rows = (len(categorical_columns) + 1) // num_cols
fig, axs = plt.subplots(num_rows, num_cols, figsize=(15, 5 * num_rows))
# Przejście przez wszystkie kategoryczne kolumny w DataFrame
for i, column in enumerate(categorical_columns):
row = i // num_cols
col = i % num_cols
sns.countplot(data=df_train, x=column, hue=df_train['TARGET'], ax=axs[row, col])
axs[row, col].set_title(f'Rozkład zmiennej - {column}')
axs[row, col].set_ylabel('Liczba')
axs[row, col].tick_params(axis='x', rotation=45) # Obrót etykiet, jeśli są zbyt długie
# Jeżeli liczba subplotów jest nieparzysta, ukrywamy ostatni subplot, jeśli jest pusty
if len(categorical_columns) % num_cols != 0:
axs[-1, -1].axis('off')
plt.tight_layout()
plt.show()
# Usunięcie zmiennej CODE_GENDER, ponieważ oficjalnie dane dotyczące płci nie powinny być wykorzystane w modelu
columns_to_drop = ['CODE_GENDER']
df_train = df_train.drop(columns=columns_to_drop)
df_test = df_test.drop(columns=columns_to_drop)
df_all = df_all.drop(columns=columns_to_drop)
df_train.shape
(452394, 47)
df_test.shape
(61503, 47)
df_all.shape
(513897, 48)
Wybór najlepszych zmiennych do modeli¶
Po wcześniejszych analizach, pozostało 46 zmiennych objaśniających. Wciąż jest to za dużo i teraz spróbuję wybrać tylko około 20 najlepszych zmiennych, które będą brane do ostatecznych modeli scoringowych
Random Forest Importance¶
Feature Importance to technika służąca do oceny, które zmienne w zestawie danych mają największy wpływ na przewidywany wynik. Algorytmy oparte na drzewach decyzyjnych, takie jak Random Forest czy Gradient Boosting Machines (GBM), są szczególnie przydatne, ponieważ oceniają ważność cechy podczas konstruowania drzew, analizując, jak poszczególne cechy wpływają na czystość węzłów.
Lasy losowe to jeden z najpopularniejszych algorytmów uczenia maszynowego. Są bardzo skuteczne, ponieważ zapewniają ogólnie dobrą wydajność predykcyjną, niski overfitting i łatwą interpretację istotności zmiennych. Innymi słowy, łatwo jest obliczyć, w jakim stopniu każda zmienna przyczynia się do podjęcia decyzji.
Lasy losowe składają się z dziesiątek, setek lub tysięcy drzew decyzyjnych, z których każde zbudowane jest na podstawie losowego wyboru obserwacji ze zbioru danych i losowej ilości zmiennych. Żadne drzewo nie widzi wszystkich danych i cech, co gwarantuje, że drzewa są mniej podatne na nadmierne dopasowanie.
Każde drzewo jest również sekwencją pytań tak-nie opartych na jednej zmiennej, np. „czy klient zarabia mniej niż 5.000 pln?”. W każdym pytaniu (węźle drzewa) dzielą zbiór danych na 2 segmenty, z których każdy zawiera obserwacje, które są między sobą bardziej podobne i różnią się od tych z drugiego segmentu. W przypadku klasyfikacji miarą zanieczyszczenia jest zanieczyszczenie Giniego lub entropia. Dlatego podczas trenowania drzewa można obliczyć, o ile każda cecha zmniejsza zanieczyszczenie. Im bardziej dana cecha zmniejsza zanieczyszczenie, tym ważniejsza jest to cecha. W lasach losowych spadek zanieczyszczenia z każdej cechy można uśrednić między drzewami, aby określić ostateczne znaczenie zmiennej.
Mówiąc prościej, zmienne wybrane na wierzchołkach drzew są ogólnie ważniejsze niż cechy wybrane na końcowych węzłach drzew. Dzieje się tak, ponieważ górne podziały prowadzą do większego przyrostu informacji.
Przygotowanie danych¶
Obsługa brakujących wartości dla zmiennych ilościowych¶
Większość algorytmów wymaga obsługi braków danych. Nam zależy na tym, żeby jednak informacja o brakach danych pozostała, bo może nieść ze sobą cenne informacje, a potem w procesie dyskretyzacji wyodrębnimy w ten sposób osobne grupy klientów. W związku z tym, dla zmiennych ilościowych spróbuję znaleźć unikatową liczbę, która będzie symbolizować braki danych (-99999).
# Lista zbiorów danych do analizy
datasets = [df_train, df_test, df_all]
dataset_names = ['df_train', 'df_test', 'df_all']
# Przechowywanie informacji czy znaleziono jakieś kolumny z -99999
found_minus_99999 = False
# Przejście przez każdy zbiór danych
for dataset, name in zip(datasets, dataset_names):
numerical_columns = dataset.select_dtypes(include=[np.number]).columns
print(f"Liczba wystąpień wartości -99999 w kolumnach numerycznych dla {name} (tylko gdy > 0):")
local_found = False
# Wypisanie liczby wystąpień wartości -99999 tylko dla kolumn, gdzie jest więcej niż 0 takich wartości
for column in numerical_columns:
count_minus_99999 = dataset[column].value_counts().get(-99999, 0)
if count_minus_99999 > 0:
print(f"{column}: {count_minus_99999}")
found_minus_99999 = True
local_found = True
# Informacja o braku wystąpień -99999 w danym zbiorze
if not local_found:
print(f"Brak kolumn z wartością -99999 w {name}.")
Liczba wystąpień wartości -99999 w kolumnach numerycznych dla df_train (tylko gdy > 0): Brak kolumn z wartością -99999 w df_train. Liczba wystąpień wartości -99999 w kolumnach numerycznych dla df_test (tylko gdy > 0): Brak kolumn z wartością -99999 w df_test. Liczba wystąpień wartości -99999 w kolumnach numerycznych dla df_all (tylko gdy > 0): Brak kolumn z wartością -99999 w df_all.
Można zatem przyjąć liczbę -99999 jako brak danych dla zmiennych ilościowych
for dataset in datasets:
numerical_columns = dataset.select_dtypes(include=[np.number]).columns
dataset[numerical_columns] = dataset[numerical_columns].fillna(-99999)
df_train.isnull().any().any()
False
df_test.isnull().any().any()
False
df_all.isnull().any().any()
False
Podział zbiorów na x i y¶
df_train.head(1)
| TARGET | AMT_GOODS_PRICE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OCCUPATION_TYPE | REGION_RATING_CLIENT | REG_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | BUREAU_DAYS_CREDIT_MEAN | BUREAU_DAYS_CREDIT_MAX | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | BUREAU_AMT_CREDIT_SUM_MEAN | BUREAU_AMT_CREDIT_SUM_SUM | BUREAU_DAYS_CREDIT_UPDATE_MIN | BUREAU_MOST_FREQ_CREDIT_ACTIVE | BUREAU_MOST_FREQ_CREDIT_TYPE | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | POS_CASH_SK_DPD_MEAN | POS_CASH_APP_COUNT | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 112500.00 | Working | Secondary / secondary special | Married | -14669 | -6152.00 | -4731 | Security staff | 2 | 0 | Business Entity Type 3 | 0.18 | 0.68 | -228 | 1 | -1317.00 | -151.00 | 325.33 | 395797.50 | 2374785.00 | -518.00 | Active | Consumer credit | 5099.58 | -2657.00 | -3698.00 | 1.06 | 1.25 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | Cash through the bank | Mobile | Country-wide | POS mobile with interest | 0.00 | 3.00 | 0.04 | 0.00 | 10.00 | -99999.00 | -99999.00 | -99999.00 |
df_train.shape
(452394, 47)
# Podział danych na zmienną celu (y) jako pierwszą kolumnę i cechy (X) jako pozostałe kolumny
X_train = df_train.iloc[:, 1:] # Wyklucza pierwszą kolumnę
y_train = df_train.iloc[:, 0] # Pierwsza kolumna jako zmienna zależna
X_test = df_test.iloc[:, 1:] # Wyklucza pierwszą kolumnę
y_test = df_test.iloc[:, 0] # Pierwsza kolumna jako zmienna zależna
X_train.shape
(452394, 46)
X_train.head(5)
| AMT_GOODS_PRICE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OCCUPATION_TYPE | REGION_RATING_CLIENT | REG_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | BUREAU_DAYS_CREDIT_MEAN | BUREAU_DAYS_CREDIT_MAX | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | BUREAU_AMT_CREDIT_SUM_MEAN | BUREAU_AMT_CREDIT_SUM_SUM | BUREAU_DAYS_CREDIT_UPDATE_MIN | BUREAU_MOST_FREQ_CREDIT_ACTIVE | BUREAU_MOST_FREQ_CREDIT_TYPE | PREVIOUS_APPLICATION_AMT_ANNUITY_MIN | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | PREVIOUS_APPLICATION_DAYS_DECISION_SUM | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN | PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX | PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW | PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA | PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE | PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION | POS_CASH_SK_DPD_MEAN | POS_CASH_APP_COUNT | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN | CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM | CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 112500.00 | Working | Secondary / secondary special | Married | -14669 | -6152.00 | -4731 | Security staff | 2 | 0 | Business Entity Type 3 | 0.18 | 0.68 | -228 | 1 | -1317.00 | -151.00 | 325.33 | 395797.50 | 2374785.00 | -518.00 | Active | Consumer credit | 5099.58 | -2657.00 | -3698.00 | 1.06 | 1.25 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | Cash through the bank | Mobile | Country-wide | POS mobile with interest | 0.00 | 3.00 | 0.04 | 0.00 | 10.00 | -99999.00 | -99999.00 | -99999.00 |
| 1 | 76500.00 | Working | Higher education | Married | -17875 | -8753.00 | -1427 | Laborers | 2 | 0 | Kindergarten | 0.38 | 0.83 | -2028 | 1 | -2914.00 | -2914.00 | -2549.00 | 45000.00 | 45000.00 | -2546.00 | Closed | Consumer credit | 8729.55 | -708.00 | -708.00 | 0.85 | 0.85 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | Cash through the bank | Computers | Regional / Local | POS household with interest | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | -99999.00 | -99999.00 | -99999.00 |
| 2 | 675000.00 | Commercial associate | Higher education | Married | -10648 | -1208.00 | -980 | Managers | 3 | 0 | Transport: type 4 | 0.54 | 0.77 | -574 | 1 | -1110.00 | -566.00 | 816.75 | 1472400.00 | 7362000.00 | -485.00 | Closed | Consumer credit | 24996.96 | -574.00 | -574.00 | 0.86 | 0.86 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | XNA | XNA | Credit and cash offices | Cash Street: low | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | -99999.00 | -99999.00 | -99999.00 |
| 3 | 405000.00 | Commercial associate | Secondary / secondary special | Single / not married | -19244 | -8562.00 | -1574 | Brak Danych | 2 | 0 | Housing | 0.23 | -99999.00 | 0 | 0 | -953.50 | -506.00 | 173.50 | 88823.25 | 355293.00 | -1124.00 | Active | Credit card | 30667.32 | -236.00 | -236.00 | 1.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | Cash through the bank | Clothing and Accessories | Stone | POS industry with interest | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | -99999.00 | -99999.00 | -99999.00 |
| 4 | 675000.00 | Commercial associate | Secondary / secondary special | Married | -20437 | -12796.00 | -1119 | Managers | 1 | 0 | Business Entity Type 3 | 0.60 | 0.41 | -257 | 0 | -257.00 | -257.00 | 111.00 | 514003.50 | 514003.50 | -2.00 | Active | Consumer credit | 53148.42 | -257.00 | -514.00 | 1.00 | 1.00 | 0.00 | 1.00 | 2.00 | 0.00 | 0.00 | 0.00 | Cash through the bank | Furniture | Country-wide | POS industry with interest | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | -99999.00 | -99999.00 | -99999.00 |
y_train.shape
(452394,)
y_train.head(1)
0 0 Name: TARGET, dtype: object
Warto wspomnieć, że model uczenia maszynowego w bibliotece sklearn wymaga, aby wszystkie cechy wejściowe były w formacie numerycznym. W związku z tym, nie będę zamieniał moich zmiennych kategorycznych, tylko po prostu przeprowadzę feature selection tylko dla zmiennych numerycznych, ponieważ jest ich zdecydowanie więcej.
Co do zmiennych kategorycznych to jest ich obecnie 11 i o ich redukcji zdecyduję na etapie dyskretyzacji zmiennych. Chciałbym, żeby pozostało z nich około 5 zmiennych.
# Wybieranie kolumn numerycznych
X_train_numeric = X_train.select_dtypes(include=['int64', 'float64'])
X_test_numeric = X_test.select_dtypes(include=['int64', 'float64'])
Trenowanie modelu i ocena ważności cech¶
from sklearn import preprocessing, ensemble
import plotly.graph_objects as go
import plotly.offline as py
from sklearn.ensemble import RandomForestClassifier
# Tworzenie i trenowanie modelu RandomForest tylko na numerycznych danych
model_numeric = RandomForestClassifier(n_estimators=1000, max_depth=10) # Liczba drzew ustawiona na 1000 (wynik jest średnią drzew)
model_numeric.fit(X_train_numeric, y_train)
# Pobranie ważności cech
importances_numeric = model_numeric.feature_importances_
# Tworzenie DataFrame z ważnościami cech numerycznych
df_numeric_importance = pd.concat([
pd.DataFrame(X_train_numeric.columns, columns=['feat']),
pd.DataFrame(importances_numeric, columns=['importance'])
], axis=1).sort_values(by='importance', ascending=False)
# Wyświetlenie ważności zmiennych
print(df_numeric_importance)
feat importance 6 EXT_SOURCE_2 0.22 7 EXT_SOURCE_3 0.20 1 DAYS_BIRTH 0.05 10 BUREAU_DAYS_CREDIT_MEAN 0.04 29 INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN 0.04 34 CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN 0.03 8 DAYS_LAST_PHONE_CHANGE 0.03 0 AMT_GOODS_PRICE 0.03 17 PREVIOUS_APPLICATION_DAYS_DECISION_MIN 0.03 11 BUREAU_DAYS_CREDIT_MAX 0.03 32 CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN 0.02 22 PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED 0.02 31 INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED 0.02 12 BUREAU_DAYS_CREDIT_ENDDATE_MEAN 0.02 20 PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX 0.02 3 DAYS_ID_PUBLISH 0.02 14 BUREAU_AMT_CREDIT_SUM_SUM 0.02 13 BUREAU_AMT_CREDIT_SUM_MEAN 0.02 18 PREVIOUS_APPLICATION_DAYS_DECISION_SUM 0.02 19 PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN 0.01 15 BUREAU_DAYS_CREDIT_UPDATE_MIN 0.01 2 DAYS_REGISTRATION 0.01 30 INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX 0.01 28 POS_CASH_APP_COUNT 0.01 16 PREVIOUS_APPLICATION_AMT_ANNUITY_MIN 0.01 4 REGION_RATING_CLIENT 0.01 9 FLAG_DOCUMENT_3 0.01 33 CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM 0.01 27 POS_CASH_SK_DPD_MEAN 0.01 26 PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH 0.01 23 PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW 0.01 5 REG_CITY_NOT_WORK_CITY 0.00 25 PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA 0.00 21 PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS 0.00 24 PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT 0.00
Wyniki są bardzo sensowne i wydają się odzworowywać rzeczywistą moc predykcyjną modeli. Zdecydowałem o pozostawieniu tylko zmiennych z TOP20 pod względem miary Importance oraz wybranych ekspercko dodatkowo 4 zmiennych:'DAYS_REGISTRATION', 'POS_CASH_APP_COUNT', 'REGION_RATING_CLIENT', 'FLAG_DOCUMENT_3'.
# Pozostawienie tylko 20 najelepszych i 4 wybranych koliumn, czyli usuniecie pozostałych
cols_with_low_importance = df_numeric_importance[~df_numeric_importance['feat'].isin(['DAYS_REGISTRATION', 'POS_CASH_APP_COUNT', 'REGION_RATING_CLIENT', 'FLAG_DOCUMENT_3'])].sort_values(by='importance', ascending=False)['feat'].iloc[20:].tolist()
# Wydrukuj listę kolumn
print(cols_with_low_importance)
['BUREAU_DAYS_CREDIT_UPDATE_MIN', 'INSTALLMENTS_PAYMENTS_PAYMENT_UNDERPAYMENT_MAX', 'PREVIOUS_APPLICATION_AMT_ANNUITY_MIN', 'CREDIT_CARD_BALANCE_CNT_INSTALMENT_MATURE_CUM_SUM', 'POS_CASH_SK_DPD_MEAN', 'PREVIOUS_APPLICATION_NAME_YIELD_GROUP_HIGH', 'PREVIOUS_APPLICATION_NAME_CLIENT_TYPE_NEW', 'REG_CITY_NOT_WORK_CITY', 'PREVIOUS_APPLICATION_NAME_YIELD_GROUP_XNA', 'PREVIOUS_APPLICATION_NAME_CONTRACT_TYPE_REVOLVING LOANS', 'PREVIOUS_APPLICATION_CODE_REJECT_REASON_LIMIT']
# Usuwam te zmienne ze wszystkich 3 zbiorów
df_train = df_train.drop(columns=cols_with_low_importance)
df_test = df_test.drop(columns=cols_with_low_importance)
df_all = df_all.drop(columns=cols_with_low_importance)
df_train.shape
(452394, 36)
df_test.shape
(61503, 36)
df_all.shape
(513897, 37)
Pozostało 35 zmiennych objaśniających, z czego 11 to zmienne jakościowe, które zostaną jeszcze dokładnie sprawdzone i wyselekcjonowane podczas dyskretyzacji.
Dyskretyzacja zmiennych¶
# Zrzucenie zbioru do csv, żeby można było go załadować do programu Statistica
#df_train.to_csv('df_train3.csv', index=False)
NA TYM ETAPIE ODBYŁ SIĘ PROCES DYSKRETYZACJI ZMIENNYCH W PROGRAMIE STATISTICA
Zmienne ilościowe zostały pogrupowane w logiczne grupy, a jakościowe zaregowane do większych grup. Wszystko odbyło się zgodnie z dobrymi praktykami stosowanymi podczas tego procesu, czyli:
- w każdej klasie powinno być przynajmniej 5% obserwacji i 5% złych kredytów
- wartości WoE powinny się wyraźnie różnić pomiędzy atrybutam
- trend obserwowany w wartościach WoE powinien zgadzać się z wiedzą biznesową.
Podczas dyskretyzacji bardzo ważne jest, żeby redukując liczbę przedziałów spróbować utrzymać stosunkowo jak najwyższe wartości IV
# Zmienne które podczas dyskretyzacji w Statistice zdecydowałem się usunąć całkowicie ze zbioru
columns_to_drop = [
'PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MAX', # niezrozumiałe wyniki zmiennej, usuwam
'CREDIT_CARD_BALANCE_CNT_DRAWINGS_CURRENT_MEAN', # brak stanowi zdecydowaną większość, ale nie rozróżnia dobrze złych i dobrych klientów
'CREDIT_CARD_BALANCE_CREDIT_USE_RATE_MEAN', # powód jak wyżej
'BUREAU_AMT_CREDIT_SUM_SUM', # słabe rozróznienie
'BUREAU_AMT_CREDIT_SUM_MEAN', # słabe rozróznienie
'PREVIOUS_APPLICATION_DAYS_DECISION_SUM', # niepotrzeba
'PREVIOUS_APPLICATION_RATIO_APP_TO_GET_MEAN', # niezrozumiała zmienna
'NAME_FAMILY_STATUS', # usuwam, niezrozumiała biznesowo
'BUREAU_MOST_FREQ_CREDIT_TYPE', # usuwam, najczęstsze grupy słabo rozróżniają
'PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_PAYMENT_TYPE', # sytuacja jak wyżej, usuwam
'PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_CHANNEL_TYPE', # usuwam, najliczniejsze grupy nie rozróżniają zbyt dobrze
'PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_PRODUCT_COMBINATION',
'PREVIOUS_APPLICATION_MOST_FREQ_PREVIOUS_APPLICATION_NAME_GOODS_CATEGORY',
'BUREAU_MOST_FREQ_CREDIT_ACTIVE'
]
# Usuwam te zmienne ze wszystkich 3 zbiorów
df_train = df_train.drop(columns=columns_to_drop)
df_test = df_test.drop(columns=columns_to_drop)
df_all = df_all.drop(columns=columns_to_drop)
df_train.shape
(452394, 22)
df_test.shape
(61503, 22)
df_all.shape
(513897, 23)
Pozostało 21 zmiennych objaśniających. Jest to odpowiednia liczba do tworzenia modeli scoringowych, ponieważ poza jak najlepszą dokładnością, muszą być również interpretowalne i rozumiane pod kątem biznesowym.
#df_train.to_csv('df_train.csv', index=False)
#df_test.to_csv('df_test.csv', index=False)
#df_all.to_csv('df_all.csv', index=False)
df_train.head()
| TARGET | AMT_GOODS_PRICE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OCCUPATION_TYPE | REGION_RATING_CLIENT | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | BUREAU_DAYS_CREDIT_MEAN | BUREAU_DAYS_CREDIT_MAX | BUREAU_DAYS_CREDIT_ENDDATE_MEAN | PREVIOUS_APPLICATION_DAYS_DECISION_MIN | PREVIOUS_APPLICATION_NAME_CONTRACT_STATUS_REFUSED | POS_CASH_APP_COUNT | INSTALLMENTS_PAYMENTS_IS_DELAYED_MEAN | INSTALLMENTS_PAYMENTS_PERCENTAGE_DELAYED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 112500.00 | Working | Secondary / secondary special | -14669 | -6152.00 | -4731 | Security staff | 2 | Business Entity Type 3 | 0.18 | 0.68 | -228 | 1 | -1317.00 | -151.00 | 325.33 | -2657.00 | 0.00 | 3.00 | 0.04 | 10.00 |
| 1 | 0 | 76500.00 | Working | Higher education | -17875 | -8753.00 | -1427 | Laborers | 2 | Kindergarten | 0.38 | 0.83 | -2028 | 1 | -2914.00 | -2914.00 | -2549.00 | -708.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| 2 | 0 | 675000.00 | Commercial associate | Higher education | -10648 | -1208.00 | -980 | Managers | 3 | Transport: type 4 | 0.54 | 0.77 | -574 | 1 | -1110.00 | -566.00 | 816.75 | -574.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| 3 | 0 | 405000.00 | Commercial associate | Secondary / secondary special | -19244 | -8562.00 | -1574 | Brak Danych | 2 | Housing | 0.23 | -99999.00 | 0 | 0 | -953.50 | -506.00 | 173.50 | -236.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| 4 | 0 | 675000.00 | Commercial associate | Secondary / secondary special | -20437 | -12796.00 | -1119 | Managers | 1 | Business Entity Type 3 | 0.60 | 0.41 | -257 | 0 | -257.00 | -257.00 | 111.00 | -257.00 | 1.00 | 1.00 | 0.00 | 0.00 |